The softmax function

September 12, 2024 — September 15, 2024

classification
metrics
probability
regression
statistics
Figure 1

A function which maps an arbitrary Rd-vector to the weights of a categorical distribution (i.e. the (d1)-simplex).

The d-simplex is defined as the set of K-dimensional vectors whose elements are non-negative and sum to one. Specifically,

ΔK1={pRK:pi0 for all i, and i=1Kpi=1}

This set describes all possible probability distributions over K outcomes, which aligns with the purpose of the softmax function in generating probabilities from “logits” (un-normalised log-probabilities) in classification problems.

Ubiquitous in modern classification tasks, particularly in neural networks.

Why? Well for one, it turns the slightly fiddly problem of estimating a constrained quantity into an unconstrained one, in a computationally expedient way. It’s not the only such option, but it is simple and has lots of nice mathematical symmetries. It is kinda-sorta convex in its arguments. It falls out in variational inference via KL, etc.

1 Basic

The softmax function transforms a vector of real numbers into a probability distribution over predicted output classes for classification tasks. Given a vector z=(z1,z2,,zK), the softmax function σ(z)i for the i-th component is

σ(z)i=ezik=1Kezk.

2 Derivatives

The first derivative with respect to zj is σϕ,izj=σϕ,i(δijσϕ,j)

where δij is the Kronecker delta.

The second derivative is then 2σϕ,izjzk=σϕ,i(δikσϕ,k)(δijσϕ,j)σϕ,iσϕ,j(δjkσϕ,k) i.e.

  • i=j=k: σϕ,i(1σϕ,i)(12σϕ,i)
  • i=jk, or ij=k: σϕ,iσϕ,k(2σϕ,i1)
  • ijk: 2σϕ,iσϕ,jσϕ,k

3 Non-exponential

Suppose we do not use the exp map, but generalize the softmax to use some other invertible, differentiable, increasing function ϕ:RR+. Given a vector z=(z1,z2,,zK), the generalized softmax function Φϕ(z) for the i-th component is defined as

Φϕ(z)i=ϕ(zi)k=1Kϕ(zk).

4 log-Taylor softmax

TBD

5 Via Gumbel

The softmax function can be approximated using the Gumbel-softmax trick, which is useful for training neural networks with discrete outputs.

6 Entropy

6.1 Softmax

We consider the entropy H(p) of a categorical distribution with probabilities p=[p1,p2,,pK]T, where the probabilities are given by the softmax function, pk=σk(z)=ezkj=1Kezj=ezkZ, with Z=j=1Kezj.

The entropy H(p) is by definition H(p)=k=1Kpklogpk. Substituting pk into the entropy expression, we obtain: H(p)=k=1Kpklogpk=k=1Kpkzk+k=1KpklogZ=k=1Kpkzk+logZ.

Thus, the entropy of the softmax distribution simplifies to H(σ(z))=logZk=1Kpkzk.

If we are using softmax we probably care about derivatives, so let us compute the gradient of the entropy with respect to zi, Hzi=zi(logZk=1Kpkzk)=1ZZzik=1K(pkzizk+pkδik)=pik=1K(pk(δikpi)zk+pkδik)=pi(pi(1pi)zi+pi)kipk(pi)zk=1+pi, where we used Zzi=ezi=Zpi and pkzi=pk(δikpi).

Thus, the gradient vector is zH=1+p, thence the Hessian matrix 2H 2Hzizj=zj(1+pi)=pizj=pi(δijpj)2H=diag(p)ppT.

For compactness, we define p=σ(z). Using the Taylor expansion, we approximate the entropy after a small change Δz: H(z+Δz)H(z)+(zH)TΔz+12ΔzT(2H)Δz=H(p)+(1+p)TΔz+12ΔzT(diag(p)ppT)Δz=H(p)1TΔz+pTΔz+12ΔzTdiag(p)Δz12(pTΔz)2=H(p)1TΔz+pTΔz+12i=1Kpi(Δzi)212(i=1KpiΔzi)2.

6.2 Non-exponential

Let’s extend the reasoning to category probabilities given by the generalized softmax function. pk=Φk(z)=ϕ(zk)j=1Kϕ(zj)=ϕ(zk)Z, where ϕ:RR+ is an increasing, differentiable function, and Z=j=1Kϕ(zj).

The entropy becomes H(p)=k=1Kpklogpk=k=1Kpk(logϕ(zk)logZ)=k=1Kpklogϕ(zk)+logZ.

To compute the gradient zH, we note that pkzi=pk(skδikj=1Kpjsjδij)=pkskδikpkpisi, where si=ϕ(zi)ϕ(zi).

Then, the gradient is Hzi=k=1K(pkzilogϕ(zk)+pkϕ(zk)ϕ(zk)δik)+1Zϕ(zi)=k=1K((pkskδikpkpisi)logϕ(zk)+pkskδik)+1Zϕ(zi).

7 References

Banerjee, C, Gupta, et al. 2020. Exploring Alternatives to Softmax Function.”
de Brébisson, and Vincent. 2016. An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family.”
Grave, Joulin, Cissé, et al. 2017. Efficient Softmax Approximation for GPUs.” In Proceedings of the 34th International Conference on Machine Learning.
Liang, Wang, Lei, et al. 2017. Soft-Margin Softmax for Deep Classification.” In Neural Information Processing.