A function which maps an arbitrary
The
This set describes all possible probability distributions over
Ubiquitous in modern classification tasks, particularly in neural networks.
Why? Well for one, it turns the slightly fiddly problem of estimating a constrained quantity into an unconstrained one, in a computationally expedient way. It’s not the only such option, but it is simple and has lots of nice mathematical symmetries. It is kinda-sorta convex in its arguments. It falls out in variational inference via KL, etc.
1 Basic
The softmax function transforms a vector of real numbers into a probability distribution over predicted output classes for classification tasks. Given a vector
2 Derivatives
The first derivative with respect to
where
The second derivative is then
: , or : :
3 Non-exponential
Suppose we do not use the
4 log-Taylor softmax
TBD
5 Via Gumbel
The softmax function can be approximated using the Gumbel-softmax trick, which is useful for training neural networks with discrete outputs.
6 Entropy
6.1 Softmax
We consider the entropy
The entropy
Thus, the entropy of the softmax distribution simplifies to
If we are using softmax we probably care about derivatives, so let us compute the gradient of the entropy with respect to
Thus, the gradient vector is
For compactness, we define
6.2 Non-exponential
Let’s extend the reasoning to category probabilities given by the generalized softmax function.
The entropy becomes
To compute the gradient
Then, the gradient is