Stunts with Gaussian distributions.
Let’s start here with the basic thing. The (univariate) standard Gaussian pdf
\[ \psi:x\mapsto \frac{1}{\sqrt{2\pi}}\text{exp}\left(-\frac{x^2}{2}\right) \]
We define
\[ \Psi:x\mapsto \int_{-\infty}^x\psi(t) dt \]
More generally we define
\[ \phi(x; \mu ,\sigma ^{2})={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}e^{-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}} \]
In the multivariate case, where the covariance \(\Sigma\) is strictly positive definite we can consider density of the general normal distribution over \(\mathbb{R}^k\) as
\[ \psi({x}; \mu, \Sigma) = (2\pi )^{-{\frac {k}{2}}}\det({ {\Sigma }})^{-{\frac {1}{2}}}\,e^{-{\frac {1}{2}}( {x} -{ {\mu }})^{\!{\top}}{ {\Sigma }}^{-1}( {x} -{ {\mu }})} \]
if a random variable \(Y\) has a Gaussian distribution with parameters \(\mu, \Sigma\), we write
\[Y \sim \mathcal{N}(\mu, \Sigma)\]
What is Erf again?
This erf, or error function, is a rebranding and reparameterisation of the standard univariate normal cdf popular in computer science, to give a slightly differently ambiguous name to the already ambiguously named normal density.
But I can never remember what it is. There are scaling factors tacked on.
\[ \operatorname{erf}(x) = \frac{1}{\sqrt{\pi}} \int_{-x}^x e^{-t^2} \, dt \]
which is to say
\[\begin{aligned} \Phi(x) &={\frac {1}{2}}\left[1+\operatorname {erf} \left({\frac {x}{\sqrt {2}}}\right)\right]\\ \operatorname {erf}(x) &=2\Phi (\sqrt{2}x)-1\\ \end{aligned}\]
Done.
Representations of density and CDF
Left tail of icdf
For small \(p\), the quantile function has the useful asymptotic expansion
\[ \Phi^{-1}(p) = -\sqrt{\ln\frac{1}{p^2} - \ln\ln\frac{1}{p^2} - \ln(2\pi)} + \mathcal{o}(1). \]
ODE representation for the univariate density
\[\begin{aligned} \sigma ^2 \phi'(x)+\phi(x) (x-\mu )&=0, \text{ i.e.}\\ L(x) &=(\sigma^2 D+x-\mu)\\ \end{aligned}\]
With initial conditions
\[\begin{aligned} \phi(0) &=\frac{e^{-\mu ^2/(2\sigma ^2)}}{\sqrt{2 \sigma^2\pi } }\\ \phi'(0) &=0 \end{aligned}\]
🏗 note where I learned this.
ODE representation for the univariate icdf
From (Steinbrecher and Shaw 2008) via Wikipedia.
Let us write \(w:=\Psi^{-1}\) to suppress keep notation clear.
\[\begin{aligned} {\frac {d^{2}w}{dp^{2}}} &=w\left({\frac {dw}{dp}}\right)^{2}\\ \end{aligned}\]
With initial conditions
\[\begin{aligned} w\left(1/2\right)&=0,\\ w'\left(1/2\right)&={\sqrt {2\pi }}. \end{aligned}\]
Density PDE representation as a diffusion equation
(Botev, Grotowski, and Kroese 2010)
\[\begin{aligned} \frac{\partial}{\partial t}\phi(x;t) &=\frac{1}{2}\frac{\partial^2}{\partial x^2}\phi(x;t)\\ \phi(x;0)&=\delta(x-\mu) \end{aligned}\]
Look, it’s the diffusion equation of Wiener process. Surprise.
Rational approximations
🏗
Roughness
Univariate -
\[\begin{aligned} \left\| \frac{d}{dx}\phi_\sigma \right\|_2 &= \frac{1}{4\sqrt{\pi}\sigma^3}\\ \left\| \left(\frac{d}{dx}\right)^n \phi_\sigma \right\|_2 &= \frac{\prod_{i<n}2n-1}{2^{n+1}\sqrt{\pi}\sigma^{2n+1}} \end{aligned}\]
Multidimensional marginals and conditionals
As made famous by Wiener processes in finance and Gaussian processes in Bayesian nonparametrics.
See, e.g. these lectures, or Michael I Jordan’s backgrounders.
Transformed variables
Special case.
\[ Y \sim \mathcal{N}(X\beta, I) \]
implies
\[ W^{1/2}Y \sim \mathcal{N}(W^{1/2}X\beta, W) \]
Metrics
Since Gaussian approximations pop up a lot in e.g. variational approximation problems, it is nice to know how to approximate them in probability metrics.
Wasserstein
Useful: Two Gaussians may be related thusly in Wasserstein-2 distance, i.e. \(W_2(\mu;\nu):=\inf\mathbb{E}(\Vert X-Y\Vert_2^2)^{1/2}\) for \(X\sim\nu\), \(Y\sim\mu\).
\[\begin{aligned} d&:= W_2(\mathcal{N}(\mu_1,\Sigma_1);\mathcal{N}(\mu_2,\Sigma_2))\\ \Rightarrow d^2&= \Vert \mu_1-\mu_2\Vert_2^2 + \operatorname{tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \end{aligned}\]
In the centred case this is simply
\[\begin{aligned} d&:= W_2(\mathcal{N}(0,\Sigma_1);\mathcal{N}(0,\Sigma_2))\\ \Rightarrow d^2&= \operatorname{tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \end{aligned}\]
Kullback-Leibler
Pulled from wikipedia:
\[ D_{\text{KL}}(\mathcal{N}(\mu_1,\Sigma_1)\parallel \mathcal{N}(\mu_2,\Sigma_2)) ={\frac {1}{2}}\left(\operatorname {tr} \left(\Sigma _{2}^{-1}\Sigma _{1}\right)+(\mu_{2}-\mu_{1})^{\mathsf {T}}\Sigma _{2}^{-1}(\mu_{2}-\mu_{1})-k+\ln \left({\frac {\det \Sigma _{2}}{\det \Sigma _{1}}}\right)\right).\]
In the centred case this reduces to
\[ D_{\text{KL}}(\mathcal{N}(0,\Sigma_1)\parallel \mathcal{N}(0, \Sigma_2)) ={\frac {1}{2}}\left(\operatorname{tr} \left(\Sigma _{2}^{-1}\Sigma _{1}\right)-k+\ln \left({\frac {\det \Sigma _{2}}{\det \Sigma _{1}}}\right)\right).\]
Hellinger
Djalil defines both Hellinger distance
\[\mathrm{H}(\mu,\nu) ={\Vert\sqrt{f}-\sqrt{g}\Vert}_{\mathrm{L}^2(\lambda)} =\Bigr(\int(\sqrt{f}-\sqrt{g})^2\mathrm{d}\lambda\Bigr)^{1/2}.\]
and Hellinger affinity
\[\mathrm{A}(\mu,\nu) =\int\sqrt{fg}\mathrm{d}\lambda, \quad \mathrm{H}(\mu,\nu)^2 =2-2A(\mu,\nu).\]
For Gaussians we can find this exactly:
\[\mathrm{A}(\mathcal{N}(m_1,\sigma_1^2),\mathcal{N}(m_2,\sigma_2^2)) =\sqrt{2\frac{\sigma_1\sigma_2}{\sigma_1^2+\sigma_2^2}} \exp\Bigr(-\frac{(m_1-m_2)^2}{4(\sigma_1^2+\sigma_2^2)}\Bigr),\]
In multiple dimensions:
\[\mathrm{A}(\mathcal{N}(m_1,\Sigma_1),\mathcal{N}(m_2,\Sigma_2)) =\frac{\det(\Sigma_1\Sigma_2)^{1/4}}{\det(\frac{\Sigma_1+\Sigma_2}{2})^{1/2}} \exp\Bigr(-\frac{\langle\Delta m,(\Sigma_1+\Sigma_2)^{-1}\Delta m)\rangle}{4}\Bigr).\]