Activation Function

Activation functions are non-linear. Used in a Neural Network.

Several activation functions:

Sigmoid Function
- See the page for drawbacks, but it is no longer used
$tanh (x)$
- Compared to Sigmoid, it still kills gradients when saturated, however, it is zero-centered
Rectified Linear Unit (ReLU) $\to$ $f (x) = max (0, x)$
- Does not saturate (in positive region)
- Very computationally efficient
- Converges much faster than sigmoid/tanh in practice
- Actually more biologically plausible than sigmoid
- Drawback: Not zero-centered, so there are still parts of the network that are not activated because the gradient is 0, i.e. “dead ReLU”
Leaky ReLU, $f (x) = max (0.01 x, x)$
Parametric ReLU, $f (x) = max (αx, x)$
Exponential Linear Units (ELU)
$f (x) = f (x) = {x α (e^{x} - 1) x > 0 x \leq 0$
- All benefits of ReLU
- Closer to zero mean outputs
- Negative saturation regime compared with Leaky ReLU adds some robustness to noise
Maxout Neuron
Softmax Function

In general, just use ReLU, be careful with learning rates.

There is also parametric ReLU, which makes the alpha a learnable parameter.

There’s also Mish.

🛠️ Steven Gong