Neural Networks — stacked layers of weighted sums and non-linearities
A radiologist needs software that can spot tumours in X-rays. The features that distinguish tumour from tissue are complex, non-linear, and interact in ways no human can hand-engineer. Traditional ML needs those features given to it — neural networks learn the features themselves.
Each neuron computes a weighted sum of its inputs, adds a bias, then applies a non-linear activation. Stack several layers: the first layer detects simple patterns, the next combines them into complex ones. The network learns by backpropagating the error gradient from output to input, nudging weights in the direction that reduces error.
z = w·x + b → a = σ(z)σ is the activation function. Common choices: ReLU = max(0,z), sigmoid = 1/(1+e⁻ᶻ), tanh.
∂L/∂wₗ = ∂L/∂aₗ · ∂aₗ/∂zₗ · ∂zₗ/∂wₗGradient flows backward; each layer gets its fair share of blame for the error.
w ← w − η · (1/m) Σ ∂L/∂wη = learning rate (typically 0.001–0.01). Adam adapts η per parameter.
Drag the sliders to set two input values. Watch activations propagate through a 3-layer network (2→3→3→1). Brighter nodes = higher activation. The output is a sigmoid — drag inputs to see it tip toward 0 or 1.
Start with ReLU activations and Adam optimiser (lr=0.001). This combination works well across a wide range of architectures.
The magic of deep learning is that gradients chain together. Automatic differentiation handles this — you define the forward pass; the framework handles the rest.
Don't train from scratch if a pre-trained model exists. Fine-tune the last layers on your data. 10× faster, far fewer data required.