# Conjugate priors

June 26, 2024 — July 2, 2024

functional analysis
probability
statistics

### Assumed audience:

Data scientists who must pretend they can remember statistics

A conjugate prior is one that is closed under sampling given its matched likelihood function. I occasionally see people talk about this as if it usefully applies to non-exponential family likelihoods, but I am familiar with it only in the case of exponential families, so we restrict ourselves to that case here.

It seems to arise in the 60s , and be re-interpreted in the 70s . A pragmatic intro is Fink (1997). Robert (2007) chapter 3 is gentler.

Exponential families have tractable conjugate priors, which means that the posterior distribution is in the same family as the prior, and moreover, there is a simple formula for updating the parameters. This is deliciously easy, and also misleads one into thinking that Bayes inference is much easier than it actually is in the general case, because it is so easy in this one.

We are going to observe lots of i.i.d. realisations of some variate $$X\sim p(x|\theta)$$ and would like a consistent procedure for updating our beliefs about $$\theta$$.

## 1 Exponential family likelihood

Our observation $$X$$ is assumed to arise from an exponential family likelihood. That is, given (vector) parameter $$\theta$$, $$X$$ has a density of the following form: $p(x \mid \theta) = h(x) \exp\left( \eta(\theta)^T T(x) - A(\theta) \right)$ Here:

• $$\eta(\theta)$$ is the natural (canonical) parameter, which is some transform of the naive parameter $$\theta$$. The natural parameter is the parameter of the distribution that is linear in the sufficient statistics. In fact, it is so much simpler to use the natural parameters, that we $$\theta$$ and just work with $$\eta$$ hereafter.
• $$T(x)$$ is the sufficient statistic (may be vector-valued).
• $$A(\theta)$$ is the log-partition function.
• $$h(x)$$ is the base measure.

Rewriting in natural parameters, we have $p(x \mid \eta) = h(x) \exp\left( \eta^T T(x) - A(\eta) \right).$

If we knew $$\eta$$ we would now have a distribution for $$X$$. In practice, we are not sure about $$\eta$$, so we have a prior distribution for $$\eta$$. Things will go well for us if we choose this prior to have a particular, and particularly convenient, form.

## 2 Conjugate prior

The conjugate prior for $$\eta$$ is designed to ensure that the posterior distribution remains within the same family after a realisation from that likelihood we just introduced.

A conjugate prior has to look like this: $p(\eta \mid \lambda, \nu) = f(\lambda, \nu) \exp\left( \eta^T \lambda - \nu A(\eta) \right)$ where $$\lambda$$ means something like ‘accumulated sufficient statistics from prior knowledge’ and $$\nu$$ the the ‘weight’ of the prior or the ‘number of prior observations’. These are effectively hyperparameters encoding how certain we are. This looks like an exponential family distribution, ($$f(\lambda, \nu)$$ is the $$h$$-like base measure), except for this weird scaling of the log-partition function by $$\nu$$. It is in fact a tempered exponential family.

## 3 Prior predictive distribution

The prior predictive distribution for a new observation $$x$$ is obtained by integrating the product of the likelihood and the prior over the natural parameter $$\eta$$: \begin{aligned} p(x) &= \int p(x \mid \eta) p(\eta \mid \lambda, \nu) \, d\eta\\ & = \int h(x) \exp(\eta^T T(x) - A(\eta)) f(\lambda, \nu) \exp(\eta^T \lambda - \nu A(\eta)) \, d\eta \\ &= h(x) f(\lambda, \nu) \int \exp\left(\eta^T (T(x) + \lambda) - (\nu + 1) A(\eta)\right) \, d\eta. \end{aligned}

This integral represents the normalization constant of an updated exponential family distribution with parameters updated to $$\lambda' = \lambda + T(x)$$ and $$\nu' = \nu + 1$$. Thus, the integral simplifies to $$1/f(\lambda', \nu')$$, where $$f$$ is the normalizing factor ensuring that the distribution integrates to 1. Hence, the prior predictive distribution becomes: $p(x) = \frac{h(x) f(\lambda, \nu)}{f(\lambda + T(x), \nu + 1)}$

The prior predictive distribution $$p(x)$$ essentially provides the likelihood of observing $$x$$ before any actual data are observed, based solely on the prior parameters $$\lambda$$ and $$\nu$$. This distribution reflects how beliefs encoded in the prior (through $$\lambda$$ and $$\nu$$) influence expectations about future data points, integrated over all possible values of the natural parameter $$\eta$$.

## 4 Updating the prior

Let us suppose an observation $$x\sim X$$ arrives. We would like a conjugate posterior update that incorporates the new information. The update to the conjugate prior’s parameters is:

1. $$\lambda_{\text{posterior}} \gets \lambda + T(x)$$, incorporating the new data’s sufficient statistic into the prior accumulated statistics, and
2. $$\nu_{\text{posterior}} \gets \nu + 1$$, an increment in the effective number of observations.

The posterior distribution of $$\eta$$, after observing $$x$$, in full, is thus: $p(\eta \mid \lambda + T(x), \nu + 1) = f(\lambda + T(x), \nu + 1) \exp\left( \eta^T (\lambda + T(x)) - (\nu + 1) A(\eta) \right)$

## 5 Posterior predictive

As with the prior predictive, we need to integrate out the natural parameters of the likelihood. The posterior predictive distribution for a new observation $$x'$$ given the observed data $$x$$ is obtained by integrating over the posterior distribution of $$\eta$$: $p(x' \mid x) = \int p(x' \mid \eta) p(\eta \mid x) \, d\eta$ Expanding this using the forms we derived above for $$p(x' \mid \eta)$$ and $$p(\eta \mid x)$$, we find: \begin{aligned} p(x' \mid x) &= \int h(x') \exp\left(\eta^T T(x') - A(\eta)\right) f(\lambda + T(x), \nu + 1) \exp\left(\eta^T (\lambda + T(x)) - (\nu + 1) A(\eta)\right) \, d\eta % & = h(x') f(\lambda + T(x), \nu + 1) \int \exp\left(\eta^T (T(x') + \lambda + T(x)) - (\nu + 2) A(\eta)\right) \, d\eta \end{aligned} This integral represents the normalizing constant of an updated exponential family distribution with parameters $$\lambda' = \lambda + T(x) + T(x')$$ and $$\nu' = \nu + 2$$. Thus, the integral simplifies to $$1/f(\lambda', \nu')$$, where $$f$$ is the normalizing constant. Hence, $p(x' \mid x) = \frac{h(x') f(\lambda + T(x), \nu + 1)}{f(\lambda + T(x) + T(x'), \nu + 2)}$

## 6 Mixtures

TBD. See Dalal and Hall (1983), .

## 9 References

Broderick, Wilson, and Jordan. 2018. Bernoulli.
Consonni, and Veronese. 1992. Journal of the American Statistical Association.
Dalal, and Hall. 1983. Journal of the Royal Statistical Society: Series B (Methodological).
DeGroot. 2005. Optimal Statistical Decisions.
Diaconis, and Ylvisaker. 1979. The Annals of Statistics.
Fink. 1997.
Gurevich, and Stuke. 2019.
Khan, and Lin. 2017. In Artificial Intelligence and Statistics.
Khan, and Rue. 2023.
Morris. 1983. The Annals of Statistics.
Morris, and Lock. 2009. The American Statistician.
Murphy. 2007.
O’Hagan. 2010. Kendall’s Advanced Theory of Statistics: Bayesian Inference. Volume 2B.
Orbanz. 2009.
———. 2011.
Pratt, Raiffa, and Schlaifer. 1995. Introduction to Statistical Decision Theory.
Raiffa. 2002. “Decision Analysis: A Personal Account of How It Got Started and Evolved.” Oper. Res.
Raiffa, and Schlaifer. 2000. Applied Statistical Decision Theory. Wiley Classics Library.
Robert. 2007. The Bayesian choice: from decision-theoretic foundations to computational implementation. Springer texts in statistics.
Wainwright, and Jordan. 2008. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends® in Machine Learning.