CNN Basics for Vision MCQ
Convolutional neural networks for images and the AlexNet breakthrough on ImageNet.
CNNs for Vision MCQ
Convolutional networks for images
CNNs apply learned filters locally across the spatial grid, sharing parameters across locations (translation equivariance). Stacked conv layers build hierarchical features; pooling and stride reduce resolution; normalization and skip connections appear in deeper designs used for detection and segmentation.
Parameter sharing
One conv kernel is reused at every spatial position—far fewer parameters than a fully connected layer on the full image.
Key ideas
Convolution
Sliding inner product: output channels mix local neighborhoods of input channels.
Pooling
Max or average pool reduces spatial size and adds local translation tolerance.
Stride & padding
Stride > 1 downsamples; padding preserves spatial size or aligns dimensions.
Receptive field
Region in the input that can influence one output neuron—grows with depth.
Typical CNN stack
Conv → activation → pool → … → global pool / FC → task head
AlexNet MCQ
AlexNet in context
AlexNet (Krizhevsky et al., 2012) won ImageNet ILSVRC with a large GPU-trained CNN. It popularized ReLU activations, dropout regularization, overlapping max pooling, data augmentation, and multi-GPU model parallelism for vision. Deeper stacks of conv layers followed (VGG, ResNet, …).
Why it mattered
It showed that deep CNNs scaled with data and compute could dominate hand-crafted features on a hard benchmark.
Key ideas
Architecture
Five conv layers (with LRN and pool stages) then three FC layers.
ReLU
Faster training than saturating sigmoids/tanh; helps deep nets converge.
Dropout
Randomly drops activations in FC layers to reduce co-adaptation / overfitting.
Scale
Trained on two GPUs with split conv layers—enabled larger width.
Rough data flow
227×227 input → conv/pool stages → 4096-4096-1000 FC → softmax