Distances between Gaussian distributions
Nearly equivalent to distances between symmetric positive definite matrices
June 27, 2016 — May 3, 2023
Since Gaussian approximations pop up a lot in e.g. variational approximation problems, it is nice to know how various probability metrics come out for them.
Since the “difficult” part of the problem is the distance between the covariances, this often ends up being the same, or at least closely related to the question of matrix norms, where the matrices in question are the positive (semi-)definite covariance/precision matrices.
1 Wasserstein
A useful analytic result about Wasserstein-2 distance, i.e. \(W_2(\mu;\nu):=\inf\mathbb{E}(\Vert X-Y\Vert_2^2)^{1/2}\) for \(X\sim\nu\), \(Y\sim\mu\). Two Gaussians may be related thusly (Givens and Shortt 1984):
\[\begin{aligned} d&:= W_2(\mathcal{N}(\mu_1,\Sigma_1);\mathcal{N}(\mu_2,\Sigma_2))\\ \Rightarrow d^2&= \Vert \mu_1-\mu_2\Vert_2^2 + \operatorname{tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \end{aligned}\]
In the centered case this is even simpler:
\[\begin{aligned} d&:= W_2(\mathcal{N}(0,\Sigma_1);\mathcal{N}(0,\Sigma_2))\\ \Rightarrow d^2&= \operatorname{tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \end{aligned}\]
2 Kullback-Leibler
Pulled from wikipedia:
\[ D_{\text{KL}}(\mathcal{N}(\mu_1,\Sigma_1)\parallel \mathcal{N}(\mu_2,\Sigma_2)) ={\frac {1}{2}}\left(\operatorname {tr} \left(\Sigma _{2}^{-1}\Sigma _{1}\right)+(\mu_{2}-\mu_{1})^{\mathsf {T}}\Sigma _{2}^{-1}(\mu_{2}-\mu_{1})-k+\ln \left({\frac {\det \Sigma _{2}}{\det \Sigma _{1}}}\right)\right).\]
In the centered case this reduces to
\[ D_{\text{KL}}(\mathcal{N}(0,\Sigma_1)\parallel \mathcal{N}(0, \Sigma_2)) ={\frac {1}{2}}\left(\operatorname{tr} \left(\Sigma _{2}^{-1}\Sigma _{1}\right)-k+\ln \left({\frac {\det \Sigma _{2}}{\det \Sigma _{1}}}\right)\right).\]
3 Hellinger
Djalil defines Hellinger distance
\[\mathrm{H}(\mu,\nu) ={\Vert\sqrt{f}-\sqrt{g}\Vert}_{\mathrm{L}^2(\lambda)} =\Bigr(\int(\sqrt{f}-\sqrt{g})^2\mathrm{d}\lambda\Bigr)^{1/2}.\]
via Hellinger affinity
\[\mathrm{A}(\mu,\nu) =\int\sqrt{fg}\mathrm{d}\lambda, \quad \mathrm{H}(\mu,\nu)^2 =2-2A(\mu,\nu).\]
For Gaussians, it apparently turns out that
\[\mathrm{A}(\mathcal{N}(m_1,\sigma_1^2),\mathcal{N}(m_2,\sigma_2^2)) =\sqrt{2\frac{\sigma_1\sigma_2}{\sigma_1^2+\sigma_2^2}} \exp\Bigr(-\frac{(m_1-m_2)^2}{4(\sigma_1^2+\sigma_2^2)}\Bigr),\]
In multiple dimensions:
\[\mathrm{A}(\mathcal{N}(m_1,\Sigma_1),\mathcal{N}(m_2,\Sigma_2)) =\frac{\det(\Sigma_1\Sigma_2)^{1/4}}{\det(\frac{\Sigma_1+\Sigma_2}{2})^{1/2}} \exp\Bigr(-\frac{\langle\Delta m,(\Sigma_1+\Sigma_2)^{-1}\Delta m\rangle}{4}\Bigr).\]
4 “Natural” / geodesic
High-speed introduction to geometry of distances: Moakher and Batchelor (2006), which discusses the problems of comparing covariance matrices on a natural manifold.
The geodesic distance between \(\mathbf{P}\) and \(\mathbf{Q}\) in \(\mathcal{P}(n)\) is given by Lang (1999) \[ \mathrm{d}_R(\mathbf{P}, \mathbf{Q})=\left\|\log \left(\mathbf{P}^{-1} \mathbf{Q}\right)\right\|_F=\left[\sum_{i=1}^n \log ^2 \lambda_i\left(\mathbf{P}^{-1} \mathbf{Q}\right)\right]^{1 / 2}, \] where \(\lambda_i\left(\mathbf{P}^{-1} \mathbf{Q}\right), 1 \leq i \leq n\) are the eigenvalues of the matrix \(\mathbf{P}^{-1} \mathbf{Q}\). Because \(\mathbf{P}^{-1} \mathbf{Q}\) is similar to \(\mathbf{P}^{-1 / 2} \mathbf{Q} \mathbf{P}^{-1 / 2}\), the eigenvalues \(\lambda_i\left(\mathbf{P}^{-1} \mathbf{Q}\right)\) are all positive and hence [this] is well defined for all \(\mathbf{P}\) and \(\mathbf{Q}\) of \(\mathcal{P}(n)\).
5 Maximum mean discrepancy
With the Gaussian kernel, there is a closed-form expression for the distance of an RV from the standard Gaussian (Rustamov 2021).
The computation of the MMD requires specifying a positive-definite kernel; in this paper we always assume it to be the Gaussian RBF kernel of width \(\gamma\), namely, \(k(x, y)=e^{-\|x-y\|^2 /\left(2 \gamma^2\right)}\). Here, \(x, y \in \mathbb{R}^d\), where \(d\) is the dimension of the code/latent space, and we use \(\|\cdot\|\) to denote the \(\ell_2\) norm. The population MMD can be most straightforwardly computed via the formula (Gretton et al. 2012): \[ \operatorname{MMD}^2(P, Q)=\mathbb{E}_{x, x^{\prime} \sim P}\left[k\left(x, x^{\prime}\right)\right]-2 \mathbb{E}_{x \sim P, y \sim Q}[k(x, y)]+\mathbb{E}_{y, y^{\prime} \sim Q}\left[k\left(y, y^{\prime}\right)\right] \]
[U]sing the sample \(Q_n=\left\{z_i\right\}_{i=1}^n\) of size \(n\), we replace the last two terms by the sample average and the U-statistic respectively to obtain the unbiased estimator: \[ \operatorname{MMD}_u^2\left(\mathcal{N}_d, Q_n\right)=\mathbb{E}_{x, x^{\prime} \sim \mathcal{N}_d}\left[k\left(x, x^{\prime}\right)\right]-\frac{2}{n} \sum_{i=1}^n \mathbb{E}_{x \sim \mathcal{N}_d}\left[k\left(x, z_i\right)\right]+\frac{1}{n(n-1)} \sum_{i=1}^n \sum_{j \neq i}^n k\left(z_i, z_j\right) \] Our main result is the following proposition whose proof can be found in Appendix C: Proposition. The expectations in the expression above can be computed analytically to yield the formula: \[ \operatorname{MMD}_u^2\left(\mathcal{N}_d, Q_n\right)=\left(\frac{\gamma^2}{2+\gamma^2}\right)^{d / 2}-\frac{2}{n}\left(\frac{\gamma^2}{1+\gamma^2}\right)^{d / 2} \sum_{i=1}^n e^{-\frac{\left\|z_i\right\|^2}{2\left(1+\gamma^2\right)}}+\frac{1}{n(n-1)} \sum_{i=1}^n \sum_{j \neq i}^n e^{-\frac{\left\|z_i-z_j\right\|^2}{2 \gamma^2}} \]
Extending this to arbitrary Gaussians, or analytic Gaussians, is left as an exercise.
Intuitively, we might prefer other kernels than the RBF if we are comparing Gaussians specifically; in particular, the RBF is “wasteful” in that it controls moments of all orders, whereas we might only care about the first two moments (mean and covariance), for Gaussians.