Exponential families

2016-04-19 — 2024-07-01

Suspiciously similar content

Assumed audience:

Data scientists who must pretend they can remember statistics

Figure 1: Exponential families can be handled compactly

Exponential families! The secret magic at the heart of traditional statistics.

Exponential families are probability distributions that just work, in the sense that the things we would hope we can do with them, we can. Informally, this is because a lot of the stuff we do in statistics is about multiplying probabilities, and exponential families are distributions that capture “easy-to-multiply” probabilities. Thus these are the distributions we are taught to handle in statistics classes, and which lead us to undue optimism about statistics more generally, all of which falls apart later. Often, though, we can approximate intractable families by exponential ones or cunning combinations thereof, e.g. in variational inference, so this is not a complete waste of time.

1 Background

Michael I. Jordan, why not?

Basics

2 Natural exponential families

a.k.a. NEFs. The simplest case. Suppose that $x \in X \subseteq R^{p} .$ Then, a natural exponential family of order p has a density or mass function of the form: $f_{X} (x; θ) = h (x) e^{θ^{T} x - A (θ)}$ where in this case the parameter $θ \in R^{p} .$ That is, it is a specialisation of the Exponential family where the natural statistics are (the sum of) the observations.

Important members of this sub-family: Gamma with known shape, Gaussian with known variance, negative binomial with known $r$ , Poisson, and binomial with known number of trials.

I mention this family first because it is a good intuition pump. The only problem that it has usefully solved for me so far is Gaussian Belief Propagation. That is, however, a very important case.

3 (Full-blown) exponential families

More commonly we consider the general exponential family, which allows the natural statistics and the parameters to not be in natural form to interact via some $R^{p} \to R^{q}$ function $T$ and some parameter $η \in R^{q}$ . $f_{X} (x; θ) = h (x) e^{η (θ) \cdot T (x) - A (θ)} .$ In fact, since we only ever observe the statistics by some function $η (θ)$ hereafter we will simply define the statistics of interest to be the vector $η$ , simplifying to $f_{X} (x; η) = h (x) e^{η \cdot T (x) - A (η)} .$ We call $η$ the natural or canonical parameter, and $T$ the sufficient statistic, and $A$ the log-partition function.

4 Cumulant generating function

TBC.

For the natural exponential families, $T$ is an identity map, the mean vector and covariance matrix are $E [X] = \nabla A (η) and Cov [X] = \nabla \nabla^{T} A (η)$ where $\nabla$ is the gradient and $\nabla \nabla^{⊤}$ is the Hessian matrix.

5 Natural parameters and sufficient statistics

One of the neat things about the exponential families is that the partition function, natural statistics and natural parameters are informative about each other.

The cumulant-generating function is simply $K (u | η) = A (η + u) - A (η)$ .

6 Natural exponential families with quadratic variance functions

A special case with nice properties (Morris 1982, 1983; Morris and Lock 2009).

Morris (1982):

The normal, Poisson, gamma, binomial, and negative binomial distributions are univariate natural exponential families with quadratic variance functions (the variance is at most a quadratic function of the mean). Only one other such family exists. Much theory is unified for these six natural exponential families by appeal to their quadratic variance property, including infinite divisibility, cumulants, orthogonal polynomials, large deviations, and limits in distribution.

7 Conjugate priors

A useful feature of exponential families is that they have conjugate priors, which means that the posterior distribution is in the same family as the prior, and moreover, there is a simple formula for updating the parameters. This is deliciously easy, and also misleads one into thinking that Bayes inference is much easier than it actually is in the general case. See conjugate priors.

8 PCA

PCA is famous for Gaussian data. I gather there is some sense in which it can be generalised to all exponential families as the Exponential Family PCA (Collins, Dasgupta, and Schapire 2001; Jun Li and Dacheng Tao 2013; Liu, Dobriban, and Singer 2017; Mohamed, Ghahramani, and Heller 2008).

9 For random graphs

Exponential random graph models. TBD

10 In graphical models

See message passing and conjugacy.

11 Curved exponential families

A generalisation I occasionally see is that of curved exponential families. I do not know how these work or if they have enough features to benefit me.

12 Squared Neural Families

Another generalisation. See squared neural families.

13 Tempered

Call the $f^{α}$ the $α$ -tempering of a density $f$ for $α \in (0, \infty)$ . We in fact add in a normalising constant $Z$ to ensure it is still a valid density. This ends up being handy for many purposes, including understanding conjugate priors.

$\begin{aligned} f^{α} (x; η) & = \frac{1}{Z (η, α)} {(h (x) e^{η \cdot T (x) - A (η)})}^{α} \\ = \frac{h^{α} (x) e^{α η \cdot T (x) - α A (η)}}{Z (η, α)} \end{aligned}$ Can this be normalized in general? If we require that $\begin{aligned} \int f^{α} (x; η) d x & = 1 \end{aligned}$ then we must have $\begin{aligned} 1 & = \int \frac{e^{α η \cdot T (x) + α \log h (x)}}{Z (η, α) e^{α A (η)}} d x \\ Z (η, α) & = e^{- α A (η)} \int e^{α η \cdot T (x) + α \log h (x)} d x \end{aligned}$

AFAICT we cannot say more about the normalising constant without knowing more about the form of $h$ , the base measure; sometimes it simplifies, but that depends upon the specific exponential family we are in.

14 References

Altun, Smola, and Hofmann. 2004. “Exponential Families for Conditional Random Fields.” In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. UAI ’04.

Balkema, and de Haan. 1974. “Residual Life Time at Great Age.” The Annals of Probability.

Brown. 1986. Fundamentals of Statistical Exponential Families: With Applications in Statistical Decision Theory. Lecture Notes-Monograph Series, v. 9.

Brown, Cai, and Zhou. 2010. “Nonparametric Regression in Exponential Families.” The Annals of Statistics.

Canu, and Smola. 2006. “Kernel Methods and the Exponential Family.” Neurocomputing.

Charpentier, and Flachaire. 2019. “Pareto Models for Risk Management.” arXiv:1912.11736 [Econ, Stat].

Collins, Dasgupta, and Schapire. 2001. “A Generalization of Principal Components Analysis to the Exponential Family.” In Advances in Neural Information Processing Systems.

Diaconis, and Ylvisaker. 1979. “Conjugate Priors for Exponential Families.” The Annals of Statistics.

Efron. 1978. “The Geometry of Exponential Families.” The Annals of Statistics.

Fink. 1997. “A Compendium of Conjugate Priors.”

Fisher, and Tippett. 1928. “Limiting Forms of the Frequency Distribution of the Largest or Smallest Member of a Sample.” Mathematical Proceedings of the Cambridge Philosophical Society.

Gurevich, and Stuke. 2019. “Gradient Conjugate Priors and Multi-Layer Neural Networks.”

Jensen, and Møller. 1991. “Pseudolikelihood for Exponential Family Models of Spatial Point Processes.” The Annals of Applied Probability.

Jun Li, and Dacheng Tao. 2013. “Simple Exponential Family PCA.” IEEE Transactions on Neural Networks and Learning Systems.

Jung, Schmutzhard, and Hlawatsch. 2012. “The RKHS Approach to Minimum Variance Estimation Revisited: Variance Bounds, Sufficient Statistics, and Exponential Families.” arXiv:1210.6516 [Math, Stat].

Khan, and Lin. 2017. “Conjugate-Computation Variational Inference : Converting Variational Inference in Non-Conjugate Models to Inferences in Conjugate Models.” In Artificial Intelligence and Statistics.

Khan, and Rue. 2024. “The Bayesian Learning Rule.”

Liu, Dobriban, and Singer. 2017. “E-PCA: High Dimensional Exponential Family PCA.”

Makarov. 2006. “Extreme Value Theory and High Quantile Convergence.” The Journal of Operational Risk.

McNeil. 1997. “Estimating the Tails of Loss Severity Distributions Using Extreme Value Theory.” ASTIN Bulletin: The Journal of the IAA.

Mohamed, Ghahramani, and Heller. 2008. “Bayesian Exponential Family PCA.” In Advances in Neural Information Processing Systems.

Morris. 1982. “Natural Exponential Families with Quadratic Variance Functions.” The Annals of Statistics.

———. 1983. “Natural Exponential Families with Quadratic Variance Functions: Statistical Theory.” The Annals of Statistics.

Morris, and Lock. 2009. “Unifying the Named Natural Exponential Families and Their Relatives.” The American Statistician.

Mueller. 2018. “Refining the Central Limit Theorem Approximation via Extreme Value Theory.” arXiv:1802.00762 [Math].

Okhotin, Molchanov, Vladimir, et al. 2023. “Star-Shaped Denoising Diffusion Probabilistic Models.” Advances in Neural Information Processing Systems.

Pickands III. 1975. “Statistical Inference Using Extreme Order Statistics.” The Annals of Statistics.

Ranganath, Tang, Charlin, et al. 2015. “Deep Exponential Families.” In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics.

Seeger, ed. 2005. “Expectation Propagation for Exponential Families.”

Shen, Huang, and Ye. 2004. “Adaptive Model Selection and Assessment for Exponential Family Distributions.” Technometrics.

Tansey, Padilla, Suggala, et al. 2015. “Vector-Space Markov Random Fields via Exponential Families.” In Journal of Machine Learning Research.

Tojo, and Yoshino. 2019. “A Method to Construct Exponential Families by Representation Theory.” arXiv:1811.01394 [Cs, Math, Stat].

Vajda. 1951. “Analytical Studies in Stop-Loss Reinsurance.” Scandinavian Actuarial Journal.

Wainwright, and Jordan. 2008. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends® in Machine Learning.