Distances between Gaussian distributions

Nearly equivalent to distances between symmetric positive definite matrices

June 27, 2016 — May 3, 2023

high d
linear algebra
signal processing
Figure 1

Since Gaussian approximations pop up a lot in e.g. variational approximation problems, it is nice to know how various probability metrics come out for them.

Since the “difficult” part of the problem is the distance between the covariances, this often ends up being the same, or at least closely related to the question of matrix norms, where the matrices in question are the positive (semi-)definite covariance/precision matrices.

1 Wasserstein

A useful analytic result about Wasserstein-2 distance, i.e. \(W_2(\mu;\nu):=\inf\mathbb{E}(\Vert X-Y\Vert_2^2)^{1/2}\) for \(X\sim\nu\), \(Y\sim\mu\). Two Gaussians may be related thusly (Givens and Shortt 1984): \[\begin{aligned} d&:= W_2(\mathcal{N}(\mu_1,\Sigma_1);\mathcal{N}(\mu_2,\Sigma_2))\\ \Rightarrow d^2&= \Vert \mu_1-\mu_2\Vert_2^2 + \operatorname{tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \end{aligned}\]

In the centred case this the even simpler \[\begin{aligned} d&:= W_2(\mathcal{N}(0,\Sigma_1);\mathcal{N}(0,\Sigma_2))\\ \Rightarrow d^2&= \operatorname{tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \end{aligned}\]

2 Kullback-Leibler

Pulled from wikipedia:

\[ D_{\text{KL}}(\mathcal{N}(\mu_1,\Sigma_1)\parallel \mathcal{N}(\mu_2,\Sigma_2)) ={\frac {1}{2}}\left(\operatorname {tr} \left(\Sigma _{2}^{-1}\Sigma _{1}\right)+(\mu_{2}-\mu_{1})^{\mathsf {T}}\Sigma _{2}^{-1}(\mu_{2}-\mu_{1})-k+\ln \left({\frac {\det \Sigma _{2}}{\det \Sigma _{1}}}\right)\right).\]

In the centred case this reduces to

\[ D_{\text{KL}}(\mathcal{N}(0,\Sigma_1)\parallel \mathcal{N}(0, \Sigma_2)) ={\frac {1}{2}}\left(\operatorname{tr} \left(\Sigma _{2}^{-1}\Sigma _{1}\right)-k+\ln \left({\frac {\det \Sigma _{2}}{\det \Sigma _{1}}}\right)\right).\]

3 Hellinger

Djalil defines Hellinger distance \[\mathrm{H}(\mu,\nu) ={\Vert\sqrt{f}-\sqrt{g}\Vert}_{\mathrm{L}^2(\lambda)} =\Bigr(\int(\sqrt{f}-\sqrt{g})^2\mathrm{d}\lambda\Bigr)^{1/2}.\] via Hellinger affinity \[\mathrm{A}(\mu,\nu) =\int\sqrt{fg}\mathrm{d}\lambda, \quad \mathrm{H}(\mu,\nu)^2 =2-2A(\mu,\nu).\] For Gaussians it apparently turns out that \[\mathrm{A}(\mathcal{N}(m_1,\sigma_1^2),\mathcal{N}(m_2,\sigma_2^2)) =\sqrt{2\frac{\sigma_1\sigma_2}{\sigma_1^2+\sigma_2^2}} \exp\Bigr(-\frac{(m_1-m_2)^2}{4(\sigma_1^2+\sigma_2^2)}\Bigr),\]

In multiple dimensions: \[\mathrm{A}(\mathcal{N}(m_1,\Sigma_1),\mathcal{N}(m_2,\Sigma_2)) =\frac{\det(\Sigma_1\Sigma_2)^{1/4}}{\det(\frac{\Sigma_1+\Sigma_2}{2})^{1/2}} \exp\Bigr(-\frac{\langle\Delta m,(\Sigma_1+\Sigma_2)^{-1}\Delta m)\rangle}{4}\Bigr).\]

4 “Natural” / geodesic

High speed introduction to geometry of distances: Moakher and Batchelor (2006), which discusses the problems of comparing covariance matrices on a natural manifold.

The geodesic distance between \(\mathbf{P}\) and \(\mathbf{Q}\) in \(\mathcal{P}(n)\) is given by Lang (1999) \[ \mathrm{d}_R(\mathbf{P}, \mathbf{Q})=\left\|\log \left(\mathbf{P}^{-1} \mathbf{Q}\right)\right\|_F=\left[\sum_{i=1}^n \log ^2 \lambda_i\left(\mathbf{P}^{-1} \mathbf{Q}\right)\right]^{1 / 2}, \] where \(\lambda_i\left(\mathbf{P}^{-1} \mathbf{Q}\right), 1 \leq i \leq n\) are the eigenvalues of the matrix \(\mathbf{P}^{-1} \mathbf{Q}\). Because \(\mathbf{P}^{-1} \mathbf{Q}\) is similar to \(\mathbf{P}^{-1 / 2} \mathbf{Q} \mathbf{P}^{-1 / 2}\), the eigenvalues \(\lambda_i\left(\mathbf{P}^{-1} \mathbf{Q}\right)\) are all positive and hence [this] is well defined for all \(\mathbf{P}\) and \(\mathbf{Q}\) of \(\mathcal{P}(n)\).

5 Maximum mean discrepancy

With the Gaussian kernel, there is a closed-form expression for the distance of an RV from the standard Gaussian (Rustamov 2021).

The computation of the MMD requires specifying a positive-definite kernel; in this paper we always assume it to be the Gaussian RBF kernel of width \(\gamma\), namely, \(k(x, y)=e^{-\|x-y\|^2 /\left(2 \gamma^2\right)}\). Here, \(x, y \in \mathbb{R}^d\), where \(d\) is the dimension of the code/latent space, and we use \(\|\cdot\|\) to denote the \(\ell_2\) norm. The population MMD can be most straight-forwardly computed via the formula (Gretton et al. 2012): \[ \operatorname{MMD}^2(P, Q)=\mathbb{E}_{x, x^{\prime} \sim P}\left[k\left(x, x^{\prime}\right)\right]-2 \mathbb{E}_{x \sim P, y \sim Q}[k(x, y)]+\mathbb{E}_{y, y^{\prime} \sim Q}\left[k\left(y, y^{\prime}\right)\right] \]

[U]sing the sample \(Q_n=\left\{z_i\right\}_{i=1}^n\) of size \(n\), we replace the last two terms by the sample average and the U-statistic respectively to obtain the unbiased estimator: \[ \operatorname{MMD}_u^2\left(\mathcal{N}_d, Q_n\right)=\mathbb{E}_{x, x^{\prime} \sim \mathcal{N}_d}\left[k\left(x, x^{\prime}\right)\right]-\frac{2}{n} \sum_{i=1}^n \mathbb{E}_{x \sim \mathcal{N}_d}\left[k\left(x, z_i\right)\right]+\frac{1}{n(n-1)} \sum_{i=1}^n \sum_{j \neq i}^n k\left(z_i, z_j\right) \] Our main result is the following proposition whose proof can be found in Appendix C: Proposition. The expectations in the expression above can be computed analytically to yield the formula: \[ \operatorname{MMD}_u^2\left(\mathcal{N}_d, Q_n\right)=\left(\frac{\gamma^2}{2+\gamma^2}\right)^{d / 2}-\frac{2}{n}\left(\frac{\gamma^2}{1+\gamma^2}\right)^{d / 2} \sum_{i=1}^n e^{-\frac{\left\|z_i\right\|^2}{2\left(1+\gamma^2\right)}}+\frac{1}{n(n-1)} \sum_{i=1}^n \sum_{j \neq i}^n e^{-\frac{\left\|z_i-z_j\right\|^2}{2 \gamma^2}} \]

Extending this to arbitrary Gaussians, or analytic Gaussians, is left as an exercise.

Intuitively, we might prefer other kernels than the RBF if we are comparing Gaussians specifically; in particular the RBF is “wasteful” in that it controls moments of all orders, whereas we might only care about the first two moments (mean and covariance), for Gaussians.

6 References

Givens, and Shortt. 1984. A Class of Wasserstein Metrics for Probability Distributions. The Michigan Mathematical Journal.
Gretton, Borgwardt, Rasch, et al. 2012. A Kernel Two-Sample Test.” The Journal of Machine Learning Research.
Lang. 1999. Fundamentals of Differential Geometry. Graduate Texts in Mathematics.
Magnus, and Neudecker. 2019. Matrix differential calculus with applications in statistics and econometrics. Wiley series in probability and statistics.
Meckes. 2009. On Stein’s Method for Multivariate Normal Approximation.” In High Dimensional Probability V: The Luminy Volume.
Minka. 2000. Old and new matrix algebra useful for statistics.
Moakher, and Batchelor. 2006. Symmetric Positive-Definite Matrices: From Geometry to Applications and Visualization.” In Visualization and Processing of Tensor Fields.
Omladič, and Šemrl. 1990. “On the Distance Between Normal Matrices.” Proceedings of the American Mathematical Society.
Petersen, and Pedersen. 2012. The Matrix Cookbook.”
Rustamov. 2021. Closed-Form Expressions for Maximum Mean Discrepancy with Applications to Wasserstein Auto-Encoders.” Stat.
Takatsu. 2008. On Wasserstein Geometry of the Space of Gaussian Measures.”
Zhang, Liu, Chen, et al. 2022. On the Properties of Kullback-Leibler Divergence Between Multivariate Gaussian Distributions.”