How a CNN Sees and Understands an Image
The journey begins with raw pixel data — a 3D tensor of numbers representing height, width, and color channels (RGB).
A kernel slides over the input image in 3D, computing weighted sums at each position. The result is a feature map that highlights specific patterns.
The first layer learns to detect simple, low-level features: edges, corners, and basic gradients.
The second layer combines edges to recognize textures and repeating motifs — surfaces, fabrics, natural patterns.
This layer recognizes shapes and identifiable parts of objects — peaks, tree outlines, architectural forms.
The deepest convolutional layer recognizes entire objects and large-scale structures in the image.
Global average pooling collapses each feature map into a single value — summarizing spatial information into a compact vector.
The feature vector is passed through fully connected layers, then softmax converts raw scores into a probability distribution over all classes.
The hierarchical nature of feature learning in CNNs
Early layers learn simple, universal patterns (edges, textures) that appear in any image domain.
Deeper layers learn task-specific, complex concepts — from shapes to full objects and scenes.
Global pooling discards location; fully-connected layers map features to meaning (class probabilities).
Reusing pretrained knowledge for new tasks
VGG16 — Pretrained on ImageNet
Fine-tuned for Melanoma Detection
Early layers learn universal features (edges, textures, shapes) that are useful across ALL image domains. No need to relearn them.
Freeze conv blocks 1-3 (universal features). Fine-tune blocks 4-5 to adapt to domain-specific patterns. Replace the classification head entirely.
Less training data needed. Faster convergence. Better generalization. Avoids overfitting on small datasets — critical for medical imaging.