Activation Function
An activation function is a non-linear function applied elementwise inside a Neural Network so that stacked layers don’t collapse to a single linear map.
Why does nonlinearity matter?
Without an activation, a 2-layer network is just with , a linear classifier. The activation is what lets the network compose hierarchical, non-linear feature transforms.
Intuition
Each activation is a “gate” with some shape. A linear combination picks a direction in input space, and the activation decides what to do with signals pointing that way: sigmoid says “pass through if positive-ish, else suppress”, ReLU says “pass through unchanged or kill entirely.” Saturation (sigmoid/tanh at extreme inputs) is the enemy of learning because the derivative goes to zero, so gradients can’t flow back through that neuron.
Common activations:
- Sigmoid Function: see the page for drawbacks, no longer used
- : still kills gradients when saturated, but zero-centered (vs sigmoid)
- Rectified Linear Unit (ReLU),
- Does not saturate in the positive region
- Very computationally efficient
- Converges much faster than sigmoid/tanh in practice
- More biologically plausible than sigmoid
- Drawback: not zero-centered, leading to “dead ReLU” units where gradient stays 0
- Leaky ReLU,
- Parametric ReLU,
- Exponential Linear Unit (ELU):
- All benefits of ReLU
- Closer to zero-mean outputs
- Negative saturation regime adds robustness to noise vs Leaky ReLU
- ELU/Leaky ReLU exist because “dead ReLU” is a real failure mode: once a neuron’s pre-activation is always negative, gradient is zero forever and that neuron never recovers. Letting a little negative signal through keeps the neuron alive
- Maxout Neuron
- Softmax Function
In general, just use ReLU and be careful with learning rates.

- From CS231n (?)
Parametric ReLU makes alpha a learnable parameter: https://datascience.stackexchange.com/questions/18583/leakyrelu-vs-prelu
There’s also Mish.
Modern defaults
From the CS231n Lec 4 slides:
- ReLU: safe default for FC/CNN layers, cheap, doesn’t saturate on positives
- GELU, (where is the standard normal CDF): default in Transformers (BERT, GPT), smooth, slightly negative-permitting near the origin
- SiLU / Swish, : used in newer vision models and Llama-style LLMs, smooth like GELU but cheaper
CS231n’s blunt rule of thumb: “ReLU is a good default choice for most problems.”