Bell curves
Many facts about the useful, boring, ubiquitous Gaussian. Djalil Chafaï lists Three reasons for Gaussians, emphasising more abstract, not-necessarily generative reasons.
- Gaussians as isotropic distributions — a Gaussian is the only distribution that can be both marginally independent and isotropic.
- Entropy maximizing (the Gaussian has the highest entropy out of any distribution wath fixed variance and finite entropy)
- The only stable distribution with finite variance
Many other things give rise to Gaussians; sampling distributions for test statistics, bootstrap samples, low dimensional projections, anything with the right Stein-type symmetries… There are many post hoc rationalisations that use the Gaussian in the hope that it is close enough to the real distribution: such as when we assume something is a Gaussian process because they are tractable, or seek a noise distribution that will justify quadratic loss, when we use Brownian motions in stochastic calculus because it comes out neatly, and so on.
Density, CDF
The standard (univariate) Gaussian pdf is \[ \psi:x\mapsto \frac{1}{\sqrt{2\pi}}\text{exp}\left(-\frac{x^2}{2}\right). \] Typically we allow a scale-location parameterised version \[ \phi(x; \mu,\sigma ^{2})={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}e^{-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}} \] We call the CDF \[ \Psi:x\mapsto \int_{-\infty}^x\psi(t) dt. \] In the multivariate case, where the covariance \(\Sigma\) is strictly positive definite, we can write a density of the general normal distribution over \(\mathbb{R}^k\) as \[ \psi({x}; \mu, \Sigma) = (2\pi )^{-{\frac {k}{2}}}\det(\Sigma)^{-\frac{1}{2}}\,\exp ({-\frac{1}{2}( x-\mu)^{\top}\Sigma^{-1}( x-\mu)}) \] If a random variable \(Y\) has a Gaussian distribution with parameters \(\mu, \Sigma\), we write \[Y \sim \mathcal{N}(\mu, \Sigma)\]
Taylor expansion of \(e^{-x^2/2}\) \[ e^(-x^2/2) = \sum_{k=0}^{\infty} (2^(-k) (-x^2)^k)/(k!). \]
Ortiz et al summarize Gaussian parameterisations
Score
\[\begin{aligned} \nabla_{x}\log\psi({x}; \mu, \Sigma) &= \nabla_{x}\left(-\frac{1}{2}( x-\mu)^{\top}\Sigma^{-1}( x-\mu) \right)\\ &= -( x-\mu)^{\top}\Sigma^{-1} \end{aligned}\]
Mills ratio
Mills’ ratio is \((1 - \Phi(x))/\phi(x)\) and is a workhorse for tail inequalities for Gaussians. See the review and extensions of classic results in Dümbgen (2010), found via Mike Spivey. Check out his extended justification for the classic identity
\[ \int_x^{\infty} \frac{1}{\sqrt{2\pi}} e^{-t^2/2} dt \leq \int_x^{\infty} \frac{t}{x} \frac{1}{\sqrt{2\pi}} e^{-t^2/2} dt = \frac{e^{-x^2/2}}{x\sqrt{2\pi}}.\]
Differential representations
First, trivially, \(\phi'(x)=-\frac{e^{-\frac{x^2}{2}} x}{\sqrt{2 \pi }}.\)
Stein’s lemma
Meckes (2009) explains Stein (1972)’s characterisation:
The normal distribution is the unique probability measure \(\mu\) for which \[ \int\left[f^{\prime}(x)-x f(x)\right] \mu(d x)=0 \] for all \(f\) for which the left-hand side exists and is finite.
This is incredibly useful in probability approximation by Gaussians where it justifies Stein’s method.
ODE representation for the univariate density
\[\begin{aligned} \sigma ^2 \phi'(x)+\phi(x) (x-\mu )&=0, \text{ i.e.}\\ L(x) &=(\sigma^2 D+x-\mu)\\ \end{aligned}\]
With initial conditions
\[\begin{aligned} \phi(0) &=\frac{e^{-\mu ^2/(2\sigma ^2)}}{\sqrt{2 \sigma^2\pi } }\\ \phi'(0) &=0 \end{aligned}\]
🏗 note where I learned this.
ODE representation for the univariate icdf
From (Steinbrecher and Shaw 2008) via Wikipedia.
Let us write \(w:=\Psi^{-1}\) to suppress keep notation clear.
\[\begin{aligned} {\frac {d^{2}w}{dp^{2}}} &=w\left({\frac {dw}{dp}}\right)^{2}\\ \end{aligned}\]
With initial conditions
\[\begin{aligned} w\left(1/2\right)&=0,\\ w'\left(1/2\right)&={\sqrt {2\pi }}. \end{aligned}\]
Density PDE representation as a diffusion equation
Z. I. Botev, Grotowski, and Kroese (2010) notes
\[\begin{aligned} \frac{\partial}{\partial t}\phi(x;t) &=\frac{1}{2}\frac{\partial^2}{\partial x^2}\phi(x;t)\\ \phi(x;0)&=\delta(x-\mu) \end{aligned}\]
Look, it’s the diffusion equation of Wiener process. Surprise! If you think about this for a while you end up discovering Feynman-Kac formulate.
Extremes
For small \(p\), the quantile function has the asymptotic expansion \[ \Phi^{-1}(p) = -\sqrt{\ln\frac{1}{p^2} - \ln\ln\frac{1}{p^2} - \ln(2\pi)} + \mathcal{o}(1). \]
Orthogonal basis
Polynomial basis? You want the Hermite polynomials.
Rational function approximations
🏗
Roughness
Univariate -
\[\begin{aligned} \left\| \frac{d}{dx}\phi_\sigma \right\|_2 &= \frac{1}{4\sqrt{\pi}\sigma^3}\\ \left\| \left(\frac{d}{dx}\right)^n \phi_\sigma \right\|_2 &= \frac{\prod_{i<n}2n-1}{2^{n+1}\sqrt{\pi}\sigma^{2n+1}} \end{aligned}\]
Entropy
The normal distribution is the least “surprising” distribution in the sense that out of all distributions with a given mean and variance the Gaussian has the maximum entropy. Or maybe that is the most surprising, depending on your definition.
Multidimensional marginals and conditionals
Linear transforms of Gaussians are especially convenient. You could say that this is a definitional property of the Gaussian. Because we have learned to represent so many things by linear algebra, this means the pairing with Gaussians is a natural one. As made famous by Gaussian process regression in Bayesian nonparametrics.
See, e.g. these lectures, or Michael I Jordan’s backgrounders.
In practice I look up my favourite useful Gaussian identities in Petersen and Pedersen (2012) and so does everyone else I know.
Fourier representation
The Fourier transform/Characteristic function of a Gaussian is still Gaussian.
\[\mathbb{E}\exp (i\mathbf{t}\cdot \mathbf {X}) =\exp \left( i\mathbf {t} ^{\top}{\boldsymbol {\mu }}-{\tfrac {1}{2}}\mathbf {t} ^{\top}{\boldsymbol {\Sigma }}\mathbf {t} \right).\]
Transformed variates
Metrics
Since Gaussian approximations pop up a lot in e.g. variational approximation problems, it is nice to know how to relate them in probability metrics. See distance between two Gaussians.
What is Erf again?
This erf, or error function, is a rebranding and reparameterisation of the standard univariate normal cdf popular in computer science, to provide a slightly differently ambiguity to the one you are used to with the “normal” density. There are scaling factors tacked on.
\[ \operatorname{erf}(x) = \frac{1}{\sqrt{\pi}} \int_{-x}^x e^{-t^2} \, dt \] which is to say \[\begin{aligned} \Phi(x) &={\frac {1}{2}}\left[1+\operatorname {erf} \left({\frac {x}{\sqrt {2}}}\right)\right]\\ \operatorname {erf}(x) &=2\Phi (\sqrt{2}x)-1\\ \end{aligned}\]
Matrix Gaussian
See matrix gaussian.
Product of densities
A workhorse of Bayesian statistics is the product of densities, and it comes out in an occasionally-useful form for Gaussians.
Let \(\mathcal{N}_{\mathbf{x}}(\mathbf{m}, \boldsymbol{\Sigma})\) denote a density of \(\mathbf{x}\), then \[ \begin{aligned} & \mathcal{N}_{\mathbf{x}}\left(\mathbf{m}_1, \boldsymbol{\Sigma}_1\right) \cdot \mathcal{N}_{\mathbf{x}}\left(\mathbf{m}_2, \boldsymbol{\Sigma}_2\right)\propto \mathcal{N}_{\mathbf{x}}\left(\mathbf{m}_c, \boldsymbol{\Sigma}_c\right) \\ & \mathbf{m}_c=\left(\boldsymbol{\Sigma}_1^{-1}+\boldsymbol{\Sigma}_2^{-1}\right)^{-1}\left(\boldsymbol{\Sigma}_1^{-1} \mathbf{m}_1+\boldsymbol{\Sigma}_2^{-1} \mathbf{m}_2\right) \\ & \boldsymbol{\Sigma}_c=\left(\boldsymbol{\Sigma}_1^{-1}+\boldsymbol{\Sigma}_2^{-1}\right)^{-1} \end{aligned} \]
Annealed
TBC. Call the \(\alpha\)-annealing of a density \(f\) the density \(f^\alpha\).
No comments yet. Why not leave one?