Introduction to Computer Vision

What is Computer Vision?

Definition

Computer vision (CV) is the area of artificial intelligence that deals with making computers understand visual data: digital images, video streams, or 3D sensor outputs. The goal is to recover useful information—objects, boundaries, motion, depth, text, or activities—and support decisions, measurements, or automation.

Unlike simply storing or displaying pictures, a CV system builds representations (features, regions, embeddings) and applies models (classical geometry, statistical learning, or deep neural networks) to interpret what the pixels mean in context.

Computer vision vs image processing

Image processing

Operates on pixels and signals: filtering, denoising, resizing, color transforms, compression, and enhancement. Output is usually another image or numeric map.

Computer vision

Aims at semantics: detection, recognition, segmentation, tracking, pose, 3D structure, and scene understanding—often combining image processing with learning and geometry.

Short history

1960s–70s: Early work on blocks world, edges, and simple 3D from images.
1990s–2000s: Robust features (e.g. SIFT), stereo, tracking, and statistical methods.
2012 onward: Deep convolutional networks (AlexNet on ImageNet) shifted many benchmarks to end-to-end learning.
Today: Large-scale detection and segmentation, transformers for vision, generative image models, and real-time embedded CV.

Applications

Computer vision appears in many industries:

Quality inspection and robotics
Medical imaging assistance
Autonomous driving and driver assistance
Security, retail analytics, and sports analytics
OCR, document scanning, and accessibility tools
AR/VR and content creation

How this series is organized

Later chapters cover preprocessing, features, segmentation, object detection, tracking, 3D vision, CNNs, generative models, video, applications, tools, and evaluation—aligned with the sidebar. Use the left menu to jump between topics; verify code and library versions on your machine.

Frequently asked questions

It is the study and engineering of systems that interpret visual data—images or video—to extract meaningful information and support tasks like recognition, measurement, or control.

Image processing transforms images; computer vision interprets them (what and where things are, how they move, etc.). In practice, both are often used together.

Face and object recognition, medical imaging, factory inspection, autonomous vehicles, OCR, augmented reality, and video understanding are common examples.

Image processing basics

Pixels and the image as a grid

A digital image is usually a rectangular grid of pixels (picture elements). Each cell holds one or more numeric values that describe color or intensity at that location. If the image is W pixels wide and H pixels tall, there are W × H spatial locations; each location may store one value (grayscale) or several (e.g. red, green, blue).

Think of row index going downward and column index going rightward—this matches how NumPy and OpenCV typically lay out 2D arrays (img[y, x] in Python, not [x, y]).

Toy 4×4 grayscale pattern

Below, each square is one pixel; the number is an 8-bit intensity (0 = black, 255 = white). Real images are much larger, but the idea is identical.

4080120160 60100140180 80120160200 100140180220

Same idea as a NumPy array (values illustrative):

import numpy as np

# Grayscale image: height 4, width 4, one channel
gray = np.array([
    [40,  80, 120, 160],
    [60, 100, 140, 180],
    [80, 120, 160, 200],
    [100, 140, 180, 220],
], dtype=np.uint8)

print(gray.shape)   # (4, 4)  →  rows, cols
print(gray[1, 2])   # row 1, col 2 → 140

Sampling and quantization

Sampling decides how many pixels you use in space: higher width/height means finer detail but more data. Quantization decides how many distinct values each measurement can take. For common 8-bit channels, each channel has 256 levels (0–255). Medical or HDR workflows may use 12–16 bits per channel for smoother intensity steps.

Together, sampling + quantization control quality vs size: a 12 MP photo with 8-bit RGB needs far more storage than a 320×240 thumbnail, and heavy quantization (e.g. posterization) creates visible bands in smooth regions.

Spatial resolution

Often written as width × height (e.g. 1920×1080). More pixels can reveal smaller structures; downsampling loses high-frequency detail.

Bit depth

Bits per channel per pixel. 8-bit is standard for display; scientific imaging may prefer higher depth to avoid clipping dark or bright regions.

Channels, resolution, and aspect ratio

Grayscale vs color

A grayscale image uses a single intensity channel. A typical RGB color image uses three channels (red, green, blue) per pixel. Some images add an alpha channel (RGBA) for transparency. In deep learning, tensors are often shaped (height, width, channels) or (channels, height, width) depending on the framework.

One RGB pixel

At position (y, x), you might read three bytes (R, G, B). Pure red could be (255, 0, 0); mid-gray in RGB is (128, 128, 128) (approximate perceptual gray).

# A single RGB pixel as a 1×1×3 array (H, W, C)
import numpy as np
pixel = np.array([[[255, 128, 0]]], dtype=np.uint8)  # orange-ish
print(pixel.shape)  # (1, 1, 3)

Resolution and aspect ratio

Resolution is the pixel count (or physical dots per inch when printing). Aspect ratio is width:height (e.g. 16:9). Resizing without preserving aspect ratio stretches objects; letterboxing or cropping changes composition—important for detection datasets.

Small examples: OpenCV and Pillow

These snippets assume you have installed opencv-python and Pillow in your environment. Replace "photo.jpg" with a real path on your disk.

OpenCV: shape, dtype, one pixel

OpenCV loads color images in BGR order by default (blue, green, red)—not RGB. That matters when you interpret channel indices or mix libraries.

import cv2

img = cv2.imread("photo.jpg")  # None if file missing
if img is not None:
    print(img.shape)   # (H, W) or (H, W, 3)
    print(img.dtype)   # usually uint8
    y, x = 10, 20
    b, g, r = img[y, x].tolist()
    print("BGR at (y,x):", b, g, r)

Convert BGR → RGB for libraries that expect RGB (e.g. Matplotlib): rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB).

Pillow: size and mode

from PIL import Image

im = Image.open("photo.jpg")
print(im.size)    # (W, H)
print(im.mode)    # "RGB", "L" (grayscale), "RGBA", ...

gray = im.convert("L")
pixels = list(gray.getdata())  # flat sequence — use for tiny images only

For large images, prefer working with NumPy arrays via np.array(im) instead of materializing every pixel in a Python list.

File formats and memory (brief)

Lossless formats (e.g. PNG) preserve exact pixel values; lossy formats (e.g. JPEG) discard information to shrink file size—artifacts appear near edges and in skies. For training data, avoid re-saving labeled images repeatedly as JPEG if precision matters.

Raw image memory (order of magnitude): an H × W RGB uint8 image uses about H × W × 3 bytes before compression. A 3000×2000 RGB frame is roughly 18 MB in memory as a dense array.

                    Takeaways
                    Images are grids of pixels; indices are usually (row, col) or (y, x) in Python array APIs.
Grayscale = 1 channel; color often = 3 (RGB or BGR in OpenCV).
Sampling sets spatial detail; quantization sets intensity/color precision (often 8 bits per channel).
Always confirm channel order (BGR vs RGB) when mixing OpenCV with other tools.

                

Quick FAQ

Historical reasons and compatibility with older camera and display APIs. In code, convert explicitly when a library expects RGB.

Unsigned 8-bit integer: whole numbers from 0 to 255. It is the most common storage type for 8-bit display images in NumPy and OpenCV.

In OpenCV/NumPy shape, you get (height, width) or (height, width, channels). PIL’s Image.size returns (width, height). Read the docs for each API.

Chapter FAQ

Frequently asked questions

It is the study and engineering of systems that interpret visual data—images or video—to extract meaningful information and support tasks like recognition, measurement, or control.

Image processing transforms images; computer vision interprets them (what and where things are, how they move, etc.). In practice, both are often used together.

Face and object recognition, medical imaging, factory inspection, autonomous vehicles, OCR, augmented reality, and video understanding are common examples.

Quick FAQ

Historical reasons and compatibility with older camera and display APIs. In code, convert explicitly when a library expects RGB.

Unsigned 8-bit integer: whole numbers from 0 to 255. It is the most common storage type for 8-bit display images in NumPy and OpenCV.

In OpenCV/NumPy shape, you get (height, width) or (height, width, channels). PIL’s Image.size returns (width, height). Read the docs for each API.

What is Computer Vision?

Definition

Computer vision vs image processing

Image processing

Computer vision

Short history

Applications

How this series is organized

Frequently asked questions

What is computer vision?

How is it different from image processing?

What are typical applications?

Image processing basics

Pixels and the image as a grid

Toy 4×4 grayscale pattern

Sampling and quantization

Spatial resolution

Bit depth

Channels, resolution, and aspect ratio

Grayscale vs color

One RGB pixel

Resolution and aspect ratio

Small examples: OpenCV and Pillow

OpenCV: shape, dtype, one pixel

Pillow: size and mode

File formats and memory (brief)

Takeaways

Quick FAQ

Why does OpenCV use BGR?

What does uint8 mean?

Are width and height always in that order?

Chapter FAQ

Frequently asked questions

What is computer vision?

How is it different from image processing?

What are typical applications?

Quick FAQ

Why does OpenCV use BGR?

What does uint8 mean?

Are width and height always in that order?