Softmax Function
is called the “logit”.
In code, we have simply:
We use the Softmax Function to compute the Cross-Entropy Loss.
Why don't we just divide by the sum?
Simply dividing each score by the sum of all scores converts scores to probabilities. However, it doesn’t handle negative values well and lacks the ability to amplify differences between scores.
- Multinomial Logistic Regression
- probability distribution
The loss of 0 is the theoretical minimum, but the correct class should go towards infinity, and incorrect classes should go towards negative infinity.
Numeric Stability
https://stackoverflow.com/posts/49212689/timeline
Softmax function is prone to two issues:
- Overflow: It occurs when very large numbers are approximated as
infinity
- Underflow: It occurs when very small numbers (near zero in the number line) are approximated (i.e. rounded to) as
zero
To combat these issues when doing softmax computation, a common trick is to shift the input vector by subtracting the maximum element in it from all elements. For the input vector x
, define z
such that: