Topic 14 · Phase 3 · Interactive

Learn by Adjusting Weights

Neural Networks — stacked layers of weighted sums and non-linearities

↓

1

The Problem

A radiologist needs software that can spot tumours in X-rays. The features that distinguish tumour from tissue are complex, non-linear, and interact in ways no human can hand-engineer. Traditional ML needs those features given to it — neural networks learn the features themselves.

Raw pixels

256×256 = 65,536 inputs

Hidden layers

Learned features: edges, curves, shapes

Prediction

Tumour: 87% confidence

"A neural network is just a function approximator. Given enough neurons, it can approximate any continuous function — this is the Universal Approximation Theorem."

2

The Intuition

Each neuron computes a weighted sum of its inputs, adds a bias, then applies a non-linear activation. Stack several layers: the first layer detects simple patterns, the next combines them into complex ones. The network learns by backpropagating the error gradient from output to input, nudging weights in the direction that reduces error.

1

Forward pass — inputs flow left to right; each layer transforms the signal

2

Loss — compare prediction to label; compute error (MSE or cross-entropy)

3

Backprop — chain rule propagates ∂Loss/∂w through every layer

4

Gradient descent — weights updated: w ← w − η · ∂Loss/∂w

3

The Math

Single Neuron

z = w·x + b → a = σ(z)

σ is the activation function. Common choices: ReLU = max(0,z), sigmoid = 1/(1+e⁻ᶻ), tanh.

Chain Rule (Backprop)

∂L/∂wₗ = ∂L/∂aₗ · ∂aₗ/∂zₗ · ∂zₗ/∂wₗ

Gradient flows backward; each layer gets its fair share of blame for the error.

SGD Weight Update

w ← w − η · (1/m) Σ ∂L/∂w

η = learning rate (typically 0.001–0.01). Adam adapts η per parameter.

Interactive Forward Pass

Drag the sliders to set two input values. Watch activations propagate through a 3-layer network (2→3→3→1). Brighter nodes = higher activation. The output is a sigmoid — drag inputs to see it tip toward 0 or 1.

Input values

Output (sigmoid)

0.50

Neutral

4

Assumptions & Pitfalls

Vanishing gradients. In deep sigmoid networks, gradients shrink to near zero in early layers — they stop learning. Fix: use ReLU activations or batch normalisation.

Overfitting on small data. A large network can memorise training data perfectly. Use dropout, L2 regularisation, and early stopping.

Hyperparameter sensitivity. Learning rate is the most critical. Too high = divergence. Too low = slow convergence. Use Adam with lr=1e-3 as a safe default.

Data requirements. Deep networks need large labelled datasets. With <1000 examples, tree-based methods or SVMs often win.

5

When to Use

Strengths

Learns features automatically from raw data
Scales with data and compute
State-of-the-art on images, text, audio
Transfer learning: fine-tune pre-trained models

Limitations

Needs large labelled datasets
Compute intensive to train
Black box — low interpretability
Overkill for structured/tabular data

Image recognition Natural language processing Speech recognition Recommendation systems Game playing (RL)

Key Takeaways

ReLU + Adam is the Default

Start with ReLU activations and Adam optimiser (lr=0.001). This combination works well across a wide range of architectures.

Backprop is Just Chain Rule

The magic of deep learning is that gradients chain together. Automatic differentiation handles this — you define the forward pass; the framework handles the rest.

Transfer Learning First

Don't train from scratch if a pre-trained model exists. Fine-tune the last layers on your data. 10× faster, far fewer data required.