# Distances between Gaussian distributions

Nearly equivalent to distances between symmetric positive definite matrices

June 27, 2016 — May 3, 2023

algebra
Gaussian
geometry
high d
linear algebra
measure
probability
signal processing
spheres
statistics

Since Gaussian approximations pop up a lot in e.g. variational approximation problems, it is nice to know how various probability metrics come out for them.

Since the “difficult” part of the problem is the distance between the covariances, this often ends up being the same, or at least closely related to the question of matrix norms, where the matrices in question are the positive (semi-)definite covariance/precision matrices.

## 1 Wasserstein

A useful analytic result about Wasserstein-2 distance, i.e. $$W_2(\mu;\nu):=\inf\mathbb{E}(\Vert X-Y\Vert_2^2)^{1/2}$$ for $$X\sim\nu$$, $$Y\sim\mu$$. Two Gaussians may be related thusly : \begin{aligned} d&:= W_2(\mathcal{N}(\mu_1,\Sigma_1);\mathcal{N}(\mu_2,\Sigma_2))\\ \Rightarrow d^2&= \Vert \mu_1-\mu_2\Vert_2^2 + \operatorname{tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \end{aligned}

In the centred case this the even simpler \begin{aligned} d&:= W_2(\mathcal{N}(0,\Sigma_1);\mathcal{N}(0,\Sigma_2))\\ \Rightarrow d^2&= \operatorname{tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \end{aligned}

## 2 Kullback-Leibler

Pulled from wikipedia:

$D_{\text{KL}}(\mathcal{N}(\mu_1,\Sigma_1)\parallel \mathcal{N}(\mu_2,\Sigma_2)) ={\frac {1}{2}}\left(\operatorname {tr} \left(\Sigma _{2}^{-1}\Sigma _{1}\right)+(\mu_{2}-\mu_{1})^{\mathsf {T}}\Sigma _{2}^{-1}(\mu_{2}-\mu_{1})-k+\ln \left({\frac {\det \Sigma _{2}}{\det \Sigma _{1}}}\right)\right).$

In the centred case this reduces to

$D_{\text{KL}}(\mathcal{N}(0,\Sigma_1)\parallel \mathcal{N}(0, \Sigma_2)) ={\frac {1}{2}}\left(\operatorname{tr} \left(\Sigma _{2}^{-1}\Sigma _{1}\right)-k+\ln \left({\frac {\det \Sigma _{2}}{\det \Sigma _{1}}}\right)\right).$

## 3 Hellinger

Djalil defines Hellinger distance $\mathrm{H}(\mu,\nu) ={\Vert\sqrt{f}-\sqrt{g}\Vert}_{\mathrm{L}^2(\lambda)} =\Bigr(\int(\sqrt{f}-\sqrt{g})^2\mathrm{d}\lambda\Bigr)^{1/2}.$ via Hellinger affinity $\mathrm{A}(\mu,\nu) =\int\sqrt{fg}\mathrm{d}\lambda, \quad \mathrm{H}(\mu,\nu)^2 =2-2A(\mu,\nu).$ For Gaussians it apparently turns out that $\mathrm{A}(\mathcal{N}(m_1,\sigma_1^2),\mathcal{N}(m_2,\sigma_2^2)) =\sqrt{2\frac{\sigma_1\sigma_2}{\sigma_1^2+\sigma_2^2}} \exp\Bigr(-\frac{(m_1-m_2)^2}{4(\sigma_1^2+\sigma_2^2)}\Bigr),$

In multiple dimensions: $\mathrm{A}(\mathcal{N}(m_1,\Sigma_1),\mathcal{N}(m_2,\Sigma_2)) =\frac{\det(\Sigma_1\Sigma_2)^{1/4}}{\det(\frac{\Sigma_1+\Sigma_2}{2})^{1/2}} \exp\Bigr(-\frac{\langle\Delta m,(\Sigma_1+\Sigma_2)^{-1}\Delta m)\rangle}{4}\Bigr).$

## 4 “Natural” / geodesic

High speed introduction to geometry of distances: Moakher and Batchelor (2006), which discusses the problems of comparing covariance matrices on a natural manifold.

The geodesic distance between $$\mathbf{P}$$ and $$\mathbf{Q}$$ in $$\mathcal{P}(n)$$ is given by Lang (1999) $\mathrm{d}_R(\mathbf{P}, \mathbf{Q})=\left\|\log \left(\mathbf{P}^{-1} \mathbf{Q}\right)\right\|_F=\left[\sum_{i=1}^n \log ^2 \lambda_i\left(\mathbf{P}^{-1} \mathbf{Q}\right)\right]^{1 / 2},$ where $$\lambda_i\left(\mathbf{P}^{-1} \mathbf{Q}\right), 1 \leq i \leq n$$ are the eigenvalues of the matrix $$\mathbf{P}^{-1} \mathbf{Q}$$. Because $$\mathbf{P}^{-1} \mathbf{Q}$$ is similar to $$\mathbf{P}^{-1 / 2} \mathbf{Q} \mathbf{P}^{-1 / 2}$$, the eigenvalues $$\lambda_i\left(\mathbf{P}^{-1} \mathbf{Q}\right)$$ are all positive and hence [this] is well defined for all $$\mathbf{P}$$ and $$\mathbf{Q}$$ of $$\mathcal{P}(n)$$.

## 5 Maximum mean discrepancy

With the Gaussian kernel, there is a closed-form expression for the distance of an RV from the standard Gaussian .

The computation of the MMD requires specifying a positive-definite kernel; in this paper we always assume it to be the Gaussian RBF kernel of width $$\gamma$$, namely, $$k(x, y)=e^{-\|x-y\|^2 /\left(2 \gamma^2\right)}$$. Here, $$x, y \in \mathbb{R}^d$$, where $$d$$ is the dimension of the code/latent space, and we use $$\|\cdot\|$$ to denote the $$\ell_2$$ norm. The population MMD can be most straight-forwardly computed via the formula : $\operatorname{MMD}^2(P, Q)=\mathbb{E}_{x, x^{\prime} \sim P}\left[k\left(x, x^{\prime}\right)\right]-2 \mathbb{E}_{x \sim P, y \sim Q}[k(x, y)]+\mathbb{E}_{y, y^{\prime} \sim Q}\left[k\left(y, y^{\prime}\right)\right]$

[U]sing the sample $$Q_n=\left\{z_i\right\}_{i=1}^n$$ of size $$n$$, we replace the last two terms by the sample average and the U-statistic respectively to obtain the unbiased estimator: $\operatorname{MMD}_u^2\left(\mathcal{N}_d, Q_n\right)=\mathbb{E}_{x, x^{\prime} \sim \mathcal{N}_d}\left[k\left(x, x^{\prime}\right)\right]-\frac{2}{n} \sum_{i=1}^n \mathbb{E}_{x \sim \mathcal{N}_d}\left[k\left(x, z_i\right)\right]+\frac{1}{n(n-1)} \sum_{i=1}^n \sum_{j \neq i}^n k\left(z_i, z_j\right)$ Our main result is the following proposition whose proof can be found in Appendix C: Proposition. The expectations in the expression above can be computed analytically to yield the formula: $\operatorname{MMD}_u^2\left(\mathcal{N}_d, Q_n\right)=\left(\frac{\gamma^2}{2+\gamma^2}\right)^{d / 2}-\frac{2}{n}\left(\frac{\gamma^2}{1+\gamma^2}\right)^{d / 2} \sum_{i=1}^n e^{-\frac{\left\|z_i\right\|^2}{2\left(1+\gamma^2\right)}}+\frac{1}{n(n-1)} \sum_{i=1}^n \sum_{j \neq i}^n e^{-\frac{\left\|z_i-z_j\right\|^2}{2 \gamma^2}}$

Extending this to arbitrary Gaussians, or analytic Gaussians, is left as an exercise.

Intuitively, we might prefer other kernels than the RBF if we are comparing Gaussians specifically; in particular the RBF is “wasteful” in that it controls moments of all orders, whereas we might only care about the first two moments (mean and covariance), for Gaussians.

## 6 References

Givens, and Shortt. 1984. The Michigan Mathematical Journal.
Gretton, Borgwardt, Rasch, et al. 2012. The Journal of Machine Learning Research.
Lang. 1999. Fundamentals of Differential Geometry. Graduate Texts in Mathematics.
Magnus, and Neudecker. 2019. Matrix differential calculus with applications in statistics and econometrics. Wiley series in probability and statistics.
Meckes. 2009. In High Dimensional Probability V: The Luminy Volume.
Minka. 2000. Old and new matrix algebra useful for statistics.
Moakher, and Batchelor. 2006. In Visualization and Processing of Tensor Fields.
Omladič, and Šemrl. 1990. “On the Distance Between Normal Matrices.” Proceedings of the American Mathematical Society.
Petersen, and Pedersen. 2012.
Takatsu. 2008.
Zhang, Liu, Chen, et al. 2022.