The softmax function

2024-09-12 — 2024-09-15

classification

metrics

probability

regression

statistics

Suspiciously similar content

A function which maps an arbitrary $R^{d}$ -vector to the weights of a categorical distribution (i.e. the $(d - 1)$ -simplex).

The $d$ -simplex is defined as the set of $K$ -dimensional vectors whose elements are non-negative and sum to one. Specifically,

$Δ^{K - 1} = {p \in R^{K} : p_{i} \geq 0 for all i, and \sum_{i = 1}^{K} p_{i} = 1}$

This set describes all possible probability distributions over $K$ outcomes, which aligns with the purpose of the softmax function in generating probabilities from “logits” (un-normalised log-probabilities) in classification problems.

Ubiquitous in modern classification tasks, particularly in neural networks.

Why? Well for one, it turns the slightly fiddly problem of estimating a constrained quantity into an unconstrained one, in a computationally expedient way. It’s not the only such option, but it is simple and has lots of nice mathematical symmetries. It is kinda-sorta convex in its arguments. It falls out in variational inference via KL, etc.

1 Basic

The softmax function transforms a vector of real numbers into a probability distribution over predicted output classes for classification tasks. Given a vector $z = (z_{1}, z_{2}, \dots, z_{K})$ , the softmax function $σ (z)_{i}$ for the $i$ -th component is

$σ (z)_{i} = \frac{e^{z_{i}}}{\sum_{k = 1}^{K} e^{z_{k}}} .$

2 Derivatives

The first derivative with respect to $z_{j}$ is $\begin{array}{r} \frac{\partial σ_{ϕ, i}}{\partial z_{j}} = σ_{ϕ, i} (δ_{i j} - σ_{ϕ, j}) \end{array}$

where $δ_{i j}$ is the Kronecker delta.

The second derivative is then $\begin{array}{r} \frac{\partial^{2} σ_{ϕ, i}}{\partial z_{j} \partial z_{k}} = σ_{ϕ, i} (δ_{i k} - σ_{ϕ, k}) (δ_{i j} - σ_{ϕ, j}) - σ_{ϕ, i} σ_{ϕ, j} (δ_{j k} - σ_{ϕ, k}) \end{array}$ i.e.

$i = j = k$ : $σ_{ϕ, i} (1 - σ_{ϕ, i}) (1 - 2 σ_{ϕ, i})$
$i = j \neq k$ , or $i \neq j = k$ : $σ_{ϕ, i} σ_{ϕ, k} (2 σ_{ϕ, i} - 1)$
$i \neq j \neq k$ : $2 σ_{ϕ, i} σ_{ϕ, j} σ_{ϕ, k}$

3 Non-exponential

Suppose we do not use the $\exp$ map, but generalize the softmax to use some other invertible, differentiable, increasing function $ϕ : R \to R^{+}$ . Given a vector $z = (z_{1}, z_{2}, \dots, z_{K})$ , the generalized softmax function $Φ_{ϕ} (z)$ for the $i$ -th component is defined as

$Φ_{ϕ} (z)_{i} = \frac{ϕ (z_{i})}{\sum_{k = 1}^{K} ϕ (z_{k})} .$

4 log-Taylor softmax

TBD

5 Via Gumbel

The softmax function can be approximated using the Gumbel-softmax trick, which is useful for training neural networks with discrete outputs.

6 Entropy

6.1 Softmax

We consider the entropy $H (p)$ of a categorical distribution with probabilities $p = [p_{1}, p_{2}, \dots, p_{K}]^{T}$ , where the probabilities are given by the softmax function, $\begin{array}{r} p_{k} = σ_{k} (z) = \frac{e^{z_{k}}}{\sum_{j = 1}^{K} e^{z_{j}}} = \frac{e^{z_{k}}}{Z}, \end{array}$ with $Z = \sum_{j = 1}^{K} e^{z_{j}} .$

The entropy $H (p)$ is by definition $\begin{array}{r} H (p) = - \sum_{k = 1}^{K} p_{k} \log p_{k} . \end{array}$ Substituting $p_{k}$ into the entropy expression, we obtain: $\begin{aligned} H (p) & = - \sum_{k = 1}^{K} p_{k} \log p_{k} \\ = - \sum_{k = 1}^{K} p_{k} z_{k} + \sum_{k = 1}^{K} p_{k} \log Z \\ = - \sum_{k = 1}^{K} p_{k} z_{k} + \log Z . \end{aligned}$

Thus, the entropy of the softmax distribution simplifies to $\begin{array}{r} H (σ (z)) = \log Z - \sum_{k = 1}^{K} p_{k} z_{k} . \end{array}$

If we are using softmax we probably care about derivatives, so let us compute the gradient of the entropy with respect to $z_{i}$ , $\begin{aligned} \frac{\partial H}{\partial z_{i}} & = \frac{\partial}{\partial z_{i}} (\log Z - \sum_{k = 1}^{K} p_{k} z_{k}) \\ = \frac{1}{Z} \frac{\partial Z}{\partial z_{i}} - \sum_{k = 1}^{K} (\frac{\partial p_{k}}{\partial z_{i}} z_{k} + p_{k} δ_{i k}) \\ = p_{i} - \sum_{k = 1}^{K} (p_{k} (δ_{i k} - p_{i}) z_{k} + p_{k} δ_{i k}) \\ = p_{i} - (p_{i} (1 - p_{i}) z_{i} + p_{i}) - \sum_{k \neq i} p_{k} (- p_{i}) z_{k} \\ = - 1 + p_{i}, \end{aligned}$ where we used $\frac{\partial Z}{\partial z_{i}} = e^{z_{i}} = Z p_{i}$ and $\frac{\partial p_{k}}{\partial z_{i}} = p_{k} (δ_{i k} - p_{i})$ .

Thus, the gradient vector is $\begin{array}{r} \nabla_{z} H = - 1 + p, \end{array}$ thence the Hessian matrix $\nabla^{2} H$ $\begin{aligned} \frac{\partial^{2} H}{\partial z_{i} \partial z_{j}} & = \frac{\partial}{\partial z_{j}} (- 1 + p_{i}) \\ = \frac{\partial p_{i}}{\partial z_{j}} = p_{i} (δ_{i j} - p_{j}) \\ \nabla^{2} H & = diag (p) - p p^{T} . \end{aligned}$

For compactness, we define $p = σ (z)$ . Using the Taylor expansion, we approximate the entropy after a small change $Δ z$ : $\begin{aligned} H (z + Δ z) & \approx H (z) + (\nabla_{z} H)^{T} Δ z + \frac{1}{2} Δ z^{T} (\nabla^{2} H) Δ z \\ = H (p) + (- 1 + p)^{T} Δ z + \frac{1}{2} Δ z^{T} (diag (p) - p p^{T}) Δ z \\ = H (p) - 1^{T} Δ z + p^{T} Δ z + \frac{1}{2} Δ z^{T} diag (p) Δ z - \frac{1}{2} (p^{T} Δ z)^{2} \\ = H (p) - 1^{T} Δ z + p^{T} Δ z + \frac{1}{2} \sum_{i = 1}^{K} p_{i} (Δ z_{i})^{2} - \frac{1}{2} {(\sum_{i = 1}^{K} p_{i} Δ z_{i})}^{2} . \end{aligned}$

6.2 Non-exponential

Let’s extend the reasoning to category probabilities given by the generalized softmax function. $\begin{array}{r} p_{k} = Φ_{k} (z) = \frac{ϕ (z_{k})}{\sum_{j = 1}^{K} ϕ (z_{j})} = \frac{ϕ (z_{k})}{Z}, \end{array}$ where $ϕ : R \to R^{+}$ is an increasing, differentiable function, and $Z = \sum_{j = 1}^{K} ϕ (z_{j})$ .

The entropy becomes $\begin{array}{r} H (p) = - \sum_{k = 1}^{K} p_{k} \log p_{k} = - \sum_{k = 1}^{K} p_{k} (\log ϕ (z_{k}) - \log Z) = - \sum_{k = 1}^{K} p_{k} \log ϕ (z_{k}) + \log Z . \end{array}$

To compute the gradient $\nabla_{z} H$ , we note that $\begin{array}{r} \frac{\partial p_{k}}{\partial z_{i}} = p_{k} (s_{k} δ_{i k} - \sum_{j = 1}^{K} p_{j} s_{j} δ_{i j}) = p_{k} s_{k} δ_{i k} - p_{k} p_{i} s_{i}, \end{array}$ where $s_{i} = \frac{ϕ^{'} (z_{i})}{ϕ (z_{i})}$ .

Then, the gradient is $\begin{aligned} \frac{\partial H}{\partial z_{i}} & = - \sum_{k = 1}^{K} (\frac{\partial p_{k}}{\partial z_{i}} \log ϕ (z_{k}) + p_{k} \frac{ϕ^{'} (z_{k})}{ϕ (z_{k})} δ_{i k}) + \frac{1}{Z} ϕ^{'} (z_{i}) \\ = - \sum_{k = 1}^{K} ((p_{k} s_{k} δ_{i k} - p_{k} p_{i} s_{i}) \log ϕ (z_{k}) + p_{k} s_{k} δ_{i k}) + \frac{1}{Z} ϕ^{'} (z_{i}) . \end{aligned}$

7 References

Banerjee, C, Gupta, et al. 2020. “Exploring Alternatives to Softmax Function.”

de Brébisson, and Vincent. 2016. “An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family.”

Grave, Joulin, Cissé, et al. 2017. “Efficient Softmax Approximation for GPUs.” In Proceedings of the 34th International Conference on Machine Learning.

Liang, Wang, Lei, et al. 2017. “Soft-Margin Softmax for Deep Classification.” In Neural Information Processing.