A conjugate prior is one that is closed under sampling given its matched likelihood function. I occasionally see people talk about this as if it usefully applies to non-exponential family likelihoods, but I am familiar with it only in the case of exponential families, so we restrict ourselves to that case here.
It seems to arise in the 60s (DeGroot 2005; Raiffa and Schlaifer 2000), and be re-interpreted in the 70s (Diaconis and Ylvisaker 1979). A pragmatic intro is Fink (1997). Robert (2007) chapter 3 is gentler.
Exponential families have tractable conjugate priors, which means that the posterior distribution is in the same family as the prior, and moreover, there is a simple formula for updating the parameters. This is deliciously easy, and also misleads one into thinking that Bayes inference is much easier than it actually is in the general case, because it is so easy in this one.
We are going to observe lots of i.i.d. realisations of some variate
1 Exponential family likelihood
Our observation
is the natural (canonical) parameter, which is some transform of the naive parameter . The natural parameter is the parameter of the distribution that is linear in the sufficient statistics. In fact, it is so much simpler to use the natural parameters, that we drop and just work with hereafter. is the sufficient statistic (may be vector-valued). is the log-partition function. is the base measure.
Rewriting in natural parameters, we have
If we knew
2 Conjugate prior
The conjugate prior for
A conjugate prior has to look like this:
3 Prior predictive distribution
The prior predictive distribution for a new observation
This integral represents the normalization constant of an updated exponential family distribution with parameters updated to
The prior predictive distribution
4 Updating the prior
Let us suppose an observation
, incorporating the new data’s sufficient statistic into the prior accumulated statistics, and , an increment in the effective number of observations.
The posterior distribution of
5 Posterior predictive
As with the prior predictive, we need to integrate out the natural parameters of the likelihood. The posterior predictive distribution for a new observation
6 Updates
The conjugate prior for
The prior predictive distribution for a new observation
Let us suppose an observation
The posterior distribution of
The posterior predictive distribution for a new observation
The last line follows from the observation that the integral term represents the normalizing constant of an updated exponential family distribution with parameters
7 Mixtures
The under-rated bit of the conjugate prior thing is that, while the priors are themselves, not that flexible, there are some very interesting priors that can be constructed by mixtures of conjugate priors.
TBC. See Dalal and Hall (1983), O’Hagan (2010),…
Farrow’s tutorial introduction:
Consider what happens when we update our beliefs using Bayes’ theorem. Suppose we have a prior density
for a parameter and suppose the likelihood is . Then our posterior density is where Now let our prior density for a parameter
be Our posterior density is Hence we require
so and the posterior density is where
8 In nonparameterics
See (Broderick, Wilson, and Jordan 2018; Orbanz 2011).## Incoming