Many facts about the useful, boring, ubiquitous Gaussian. Djalil Chafaï lists Three reasons for Gaussians, emphasising more abstract, not-necessarily generative reasons.

- Gaussians as isotropic distributions — a Gaussian is the only distribution that can be both marginally independent and isotropic.
- Entropy maximizing (the Gaussian has the highest entropy out of any distribution wath fixed variance and finite entropy)
- The only stable distribution with finite variance

Many other things give rise to Gaussians;
sampling distributions for test statistics, bootstrap samples, low dimensional projections, anything with the right Stein-type symmetries…
There are many *post hoc* rationalisations that use the Gaussian in the hope that it is close enough to the real distribution: such as when we assume something is a Gaussian process because they are tractable, or seek a noise distribution that will justify quadratic loss, when we use Brownian motions in stochastic calculus because it comes out neatly, and so on.

## Density, CDF

The standard (univariate) Gaussian pdf is \[ \psi:x\mapsto \frac{1}{\sqrt{2\pi}}\text{exp}\left(-\frac{x^2}{2}\right). \] Typically we allow a scale-location parameterised version \[ \phi(x; \mu,\sigma ^{2})={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}e^{-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}} \] We call the CDF \[ \Psi:x\mapsto \int_{-\infty}^x\psi(t) dt. \] In the multivariate case, where the covariance \(\Sigma\) is strictly positive definite, we can write a density of the general normal distribution over \(\mathbb{R}^k\) as \[ \psi({x}; \mu, \Sigma) = (2\pi )^{-{\frac {k}{2}}}\det(\Sigma)^{-\frac{1}{2}}\,\exp ({-\frac{1}{2}( x-\mu)^{\top}\Sigma^{-1}( x-\mu)}) \] If a random variable \(Y\) has a Gaussian distribution with parameters \(\mu, \Sigma\), we write \[Y \sim \mathcal{N}(\mu, \Sigma)\]

Taylor expansion of \(e^{-x^2/2}\) \[ e^(-x^2/2) = \sum_{k=0}^{\infty} (2^(-k) (-x^2)^k)/(k!). \]

### Score

\[\begin{aligned} \nabla_{x}\log\psi({x}; \mu, \Sigma) &= \nabla_{x}\left(-\frac{1}{2}( x-\mu)^{\top}\Sigma^{-1}( x-\mu) \right)\\ &= -( x-\mu)^{\top}\Sigma^{-1} \end{aligned}\]

### Mills ratio

Mills’ ratio is \((1 - \Phi(x))/\phi(x)\) and is a workhorse for tail inequalities for Gaussians. See the review and extensions of classic results in Dümbgen (2010), found via Mike Spivey. Check out his extended justification for the classic identity

\[ \int_x^{\infty} \frac{1}{\sqrt{2\pi}} e^{-t^2/2} dt \leq \int_x^{\infty} \frac{t}{x} \frac{1}{\sqrt{2\pi}} e^{-t^2/2} dt = \frac{e^{-x^2/2}}{x\sqrt{2\pi}}.\]

## Differential representations

First, trivially, \(\phi'(x)=-\frac{e^{-\frac{x^2}{2}} x}{\sqrt{2 \pi }}.\)

### Stein’s lemma

Meckes (2009) explains Stein (1972)’s characterisation:

The normal distribution is the unique probability measure \(\mu\) for which \[ \int\left[f^{\prime}(x)-x f(x)\right] \mu(d x)=0 \] for all \(f\) for which the left-hand side exists and is finite.

This is incredibly useful in probability approximation by Gaussians where it justifies Stein’s method.

### ODE representation for the univariate density

\[\begin{aligned} \sigma ^2 \phi'(x)+\phi(x) (x-\mu )&=0, \text{ i.e.}\\ L(x) &=(\sigma^2 D+x-\mu)\\ \end{aligned}\]

With initial conditions

\[\begin{aligned} \phi(0) &=\frac{e^{-\mu ^2/(2\sigma ^2)}}{\sqrt{2 \sigma^2\pi } }\\ \phi'(0) &=0 \end{aligned}\]

🏗 note where I learned this.

### ODE representation for the univariate icdf

From (Steinbrecher and Shaw 2008) via Wikipedia.

Let us write \(w:=\Psi^{-1}\) to suppress keep notation clear.

\[\begin{aligned} {\frac {d^{2}w}{dp^{2}}} &=w\left({\frac {dw}{dp}}\right)^{2}\\ \end{aligned}\]

With initial conditions

\[\begin{aligned} w\left(1/2\right)&=0,\\ w'\left(1/2\right)&={\sqrt {2\pi }}. \end{aligned}\]

### Density PDE representation as a diffusion equation

Z. I. Botev, Grotowski, and Kroese (2010) notes

\[\begin{aligned} \frac{\partial}{\partial t}\phi(x;t) &=\frac{1}{2}\frac{\partial^2}{\partial x^2}\phi(x;t)\\ \phi(x;0)&=\delta(x-\mu) \end{aligned}\]

Look, it’s the diffusion equation of Wiener process. Surprise! If you think about this for a while you end up discovering Feynman-Kac formulate.

## Extremes

For small \(p\), the quantile function has the asymptotic expansion \[ \Phi^{-1}(p) = -\sqrt{\ln\frac{1}{p^2} - \ln\ln\frac{1}{p^2} - \ln(2\pi)} + \mathcal{o}(1). \]

## Orthogonal basis

Polynomial basis? You want the Hermite polynomials.

## Rational function approximations

🏗

## Roughness

Univariate -

\[\begin{aligned} \left\| \frac{d}{dx}\phi_\sigma \right\|_2 &= \frac{1}{4\sqrt{\pi}\sigma^3}\\ \left\| \left(\frac{d}{dx}\right)^n \phi_\sigma \right\|_2 &= \frac{\prod_{i<n}2n-1}{2^{n+1}\sqrt{\pi}\sigma^{2n+1}} \end{aligned}\]

## Entropy

The normal distribution is the least “surprising” distribution in the sense that out of all distributions with a given mean and variance the Gaussian has the maximum entropy. Or maybe that is the most surprising, depending on your definition.

## Multidimensional marginals and conditionals

Linear transforms of Gaussians are especially convenient. You could say that this is a definitional property of the Gaussian. Because we have learned to represent so many things by linear algebra, this means the pairing with Gaussians is a natural one. As made famous by Gaussian process regression in Bayesian nonparametrics.

See, e.g. these lectures, or Michael I Jordan’s backgrounders.

In practice I look up my favourite useful Gaussian identities in Petersen and Pedersen (2012) and so does everyone else I know.

## Fourier representation

The Fourier transform/Characteristic function of a Gaussian is still Gaussian.

\[\mathbb{E}\exp (i\mathbf{t}\cdot \mathbf {X}) =\exp \left( i\mathbf {t} ^{\top}{\boldsymbol {\mu }}-{\tfrac {1}{2}}\mathbf {t} ^{\top}{\boldsymbol {\Sigma }}\mathbf {t} \right).\]

## Transformed variates

## Metrics

Since Gaussian approximations pop up a lot in e.g. variational approximation problems, it is nice to know how to relate them in probability metrics. See distance between two Gaussians.

## What is Erf again?

This *erf*, or *error function*, is a rebranding and reparameterisation of the
standard univariate normal cdf popular in computer science, to provide a slightly differently ambiguity to the one you are used to with the “normal” density.
There are scaling factors tacked on.

\[ \operatorname{erf}(x) = \frac{1}{\sqrt{\pi}} \int_{-x}^x e^{-t^2} \, dt \] which is to say \[\begin{aligned} \Phi(x) &={\frac {1}{2}}\left[1+\operatorname {erf} \left({\frac {x}{\sqrt {2}}}\right)\right]\\ \operatorname {erf}(x) &=2\Phi (\sqrt{2}x)-1\\ \end{aligned}\]

## Matrix Gaussian

See matrix gaussian.

## Product of densities

A workhorse of Bayesian statistics is the product of densities, and it comes out in an occasionally-useful form for Gaussians.

Let \(\mathcal{N}_{\mathbf{x}}(\mathbf{m}, \boldsymbol{\Sigma})\) denote a density of \(\mathbf{x}\), then \[ \begin{aligned} & \mathcal{N}_{\mathbf{x}}\left(\mathbf{m}_1, \boldsymbol{\Sigma}_1\right) \cdot \mathcal{N}_{\mathbf{x}}\left(\mathbf{m}_2, \boldsymbol{\Sigma}_2\right)\propto \mathcal{N}_{\mathbf{x}}\left(\mathbf{m}_c, \boldsymbol{\Sigma}_c\right) \\ & \mathbf{m}_c=\left(\boldsymbol{\Sigma}_1^{-1}+\boldsymbol{\Sigma}_2^{-1}\right)^{-1}\left(\boldsymbol{\Sigma}_1^{-1} \mathbf{m}_1+\boldsymbol{\Sigma}_2^{-1} \mathbf{m}_2\right) \\ & \boldsymbol{\Sigma}_c=\left(\boldsymbol{\Sigma}_1^{-1}+\boldsymbol{\Sigma}_2^{-1}\right)^{-1} \end{aligned} \]

## Annealed

TBC. Call the \(\alpha\)-annealing of a density \(f\) the density \(f^\alpha\).

## References

*Journal of the Royal Statistical Society: Series B (Statistical Methodology)*79 (1): 125–48.

*The Annals of Statistics*38 (5): 2916–57.

*arXiv:1012.2063 [Math, Stat]*, December.

*The Michigan Mathematical Journal*31 (2): 231–40.

*Matrix Variate Distributions*. Chapman & Hall/CRC Monographs and Surveys in Pure and Applied Mathematics 104. Boca Raton: Chapman and Hall/CRC.

*Matrix differential calculus with applications in statistics and econometrics*. 3rd ed. Wiley series in probability and statistics. Hoboken (N.J.): Wiley.

*Heliyon*5 (2): e01136.

*High Dimensional Probability V: The Luminy Volume*, 153–78. Beachwood, Ohio, USA: Institute of Mathematical Statistics.

*Old and new matrix algebra useful for statistics*.

*Journal of Mathematics Research*2 (4): p47.

*Electronic Journal of Applied Statistical Analysis*5 (1).

*Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory*, January, 583–602.

*Approximate Computation of Expectations*. Vol. 7. IMS.

*European Journal of Applied Mathematics*19 (2): 87–112.

*Mathematics of Computation*22 (101): 144–58.

*Journal of the Royal Statistical Society. Series C (Applied Statistics)*37 (3): 477–84.

## No comments yet. Why not leave one?