← Hub

From Pixels to Probabilities

How a CNN Sees and Understands an Image

1

Input Image (Pixels)

The journey begins with raw pixel data — a 3D tensor of numbers representing height, width, and color channels (RGB).

What the network receives

  • A grid of numbers between 0-255 (or 0-1 normalized)
  • Three channels: Red, Green, Blue
  • Spatial structure: pixels next to each other are related
  • Typical size: 224x224x3 for classification
Torres del Paine - Input Image
Torres del Paine, Patagonia — 224x224x3 tensor
"The network sees numbers, not pictures — every pixel is just a number with no inherent meaning."

Interactive: How Convolution Works

A kernel slides over the input image in 3D, computing weighted sums at each position. The result is a feature map that highlights specific patterns.

Output Feature Map
Each cell shows the result of the kernel applied at that position — highlighting where the pattern was found
Select a kernel and press Animate to begin
Input Image
Kernel (filter)
Output Feature Map
2

Convolution Layer 1 (Edges)

The first layer learns to detect simple, low-level features: edges, corners, and basic gradients.

What is learned at this stage

  • Horizontal and vertical edges
  • Diagonal lines and corners
  • Simple color contrasts
  • Basic gradient directions
"Each filter responds to one specific pattern — the network learns which patterns are useful."
3

Convolution Layer 2 (Textures)

The second layer combines edges to recognize textures and repeating motifs — surfaces, fabrics, natural patterns.

What is learned at this stage

  • Rock surfaces, grass, water textures
  • Repeating patterns and motifs
  • Surface roughness and smoothness
  • Regular vs. irregular structures
"Deeper layers learn complex patterns by combining simpler ones — edges become textures."
4

Convolution Layer 3 (Shapes & Parts)

This layer recognizes shapes and identifiable parts of objects — peaks, tree outlines, architectural forms.

What is learned at this stage

  • Geometric shapes (circles, triangles, rectangles)
  • Object parts (wheels, eyes, mountain peaks)
  • Spatial combinations of textures
  • Boundary contours of objects
"Parts of objects — not whole objects yet. The network now sees structure, not just pattern."
5

Convolution Layer 4 (Objects)

The deepest convolutional layer recognizes entire objects and large-scale structures in the image.

What is learned at this stage

  • Full objects (mountains, trees, buildings, animals)
  • Scene composition and layout
  • Object relationships and context
  • High-level semantic understanding
SKY MOUNTAIN TREE WATER
"The network now understands 'what' is in the image — full semantic comprehension."
6

Global Pooling (High-Level Features)

Global average pooling collapses each feature map into a single value — summarizing spatial information into a compact vector.

What happens at this stage

  • Each feature map (e.g., 7x7) becomes a single number
  • 512 feature maps become a 512-dimensional vector
  • Spatial location is discarded
  • Only "what" remains, not "where"
"Everything the network knows about the image, compressed into one compact vector."
7

Fully Connected + Softmax (Probabilities)

The feature vector is passed through fully connected layers, then softmax converts raw scores into a probability distribution over all classes.

What happens at this stage

  • Feature vector becomes class scores (logits)
  • Softmax: e^score / sum of all e^scores
  • All probabilities sum to 1.0
  • Highest probability = model's prediction
Mountain landscape82%
Alpine lake8%
Forest landscape5%
Coastal landscape3%
Urban landscape2%
"Probabilities sum to 1.0 — the highest probability is the model's prediction: Mountain landscape (82%)."

Key Takeaways

The hierarchical nature of feature learning in CNNs

Simple to Complex

Early layers learn simple, universal patterns (edges, textures) that appear in any image domain.

Local to Global

Deeper layers learn task-specific, complex concepts — from shapes to full objects and scenes.

Spatial to Semantic

Global pooling discards location; fully-connected layers map features to meaning (class probabilities).

Transfer Learning

Reusing pretrained knowledge for new tasks

VGG16 — Pretrained on ImageNet

1000 classes · 1.2M images
Conv Block 1 — Edges
Conv Block 2 — Textures
Conv Block 3 — Shapes
Conv Block 4 — Parts
Conv Block 5 — Objects
FC Layers (4096)
Softmax (1000 classes)
Surgery

Fine-tuned for Melanoma Detection

2 classes · 10K images
Conv Block 1 — Edges
Conv Block 2 — Textures
Conv Block 3 — Shapes
Conv Block 4 — Skin Patterns
Conv Block 5 — Lesion Features
New FC Layer (256)
Softmax (Benign / Malignant)

Why It Works

Early layers learn universal features (edges, textures, shapes) that are useful across ALL image domains. No need to relearn them.

What to Freeze

Freeze conv blocks 1-3 (universal features). Fine-tune blocks 4-5 to adapt to domain-specific patterns. Replace the classification head entirely.

Benefits

Less training data needed. Faster convergence. Better generalization. Avoids overfitting on small datasets — critical for medical imaging.