“Generalized” Bayesian inference
Approximating the Gibbs posterior
2024-09-26 — 2025-05-30
Wherein the Limitations of Kullback-Leibler Divergence Under Model Misspecification Are Examined, and Alternative Divergence Measures Are Proposed as Replacements in Bayesian Updating.
1 ‘Generalized’
I dislike naming things “Generalized”, for all the obvious reasons. If biologists named Eukaryotes Generalized Prokaryotes they would be mocked. You cannot do this in the rest of science, but in machine learning, somehow it is normal, and so you get naming abominations like Generalized Generalized Models.
This naming contention will continue, and I will continue to hate it. So it goes.
“Generalized Bayesian inference” implies a certain flavour of approximation to a certain kind of relaxation of traditional Bayesian inference. Compare and contrast to “Approximate Bayesian inference”, which uses different approximations and relaxations, and is equally damnably named. Which is why, and why, are best demonstrated by example.
We follow the reasoning in Jewson, Smith, and Holmes (2018).
In the M-closed scenario — where the true data-generating process lies within the model class — Bayesian updating and maximum likelihood estimation are justified, because they minimise the KL divergence between the true distribution and the model.
But when the models are mis-specified, KL divergence — the mathematical backbone of these methods — isn’t such a slam dunk. A high-speed tour follows.
The Kullback-Leibler divergence between distributions P and Q is defined as:
\[D_{KL}(P \parallel Q) = \int p(x) \log \frac{p(x)}{q(x)} dx\]
In the well-specified case (“M-closed”), this divergence is exactly what maximum likelihood estimation minimises. When we maximise the likelihood:
\[\hat{\theta}_{MLE} = \arg\max_\theta \sum_{i=1}^n \log p(x_i | \theta)\]
We’re implicitly minimising \(D_{KL}(P_{true} \parallel P_\theta)\) where \(P_{true}\) is the true data-generating distribution. This is why MLE is “optimal”—it finds the model closest to reality in KL terms.
But here’s the problem: what if \(P_{true}\) isn’t in our model class at all? This is the M-open world. And, in fact, no model ever includes the truth; if we ignore that fact and pretend, weird stuff can happen (Shalizi 2009).
White (1982) showed that under misspecification, MLE still converges to something meaningful—the parameter that minimises KL divergence between the true distribution and our model class:
\[\theta^* = \arg\min_\theta D_{KL}(P_{true} \parallel P_\theta)\]
This was reassuring… is that reassuring? It is at best mildly reassuring: small KL discrepancy can still be pretty bad in terms of making terrible decisions.
Bissiri, Holmes, and Walker (2016) asked: why restrict ourselves to KL divergence at all? They developed a general Bayesian framework using arbitrary divergence measures \(d(\cdot, \cdot)\):
\[\pi(\theta | data) \propto \pi(\theta) \exp(-\eta \cdot d(data, model(\theta)))\]
where \(\eta\) controls the learning rate. When \(d\) is the log-likelihood, we recover standard Bayesian updating. But we can choose other divergences based on what aspects of the data matter most for our specific problem.
TODO: this looks a lot like Zellner’s argument (Zellner 1988) used in Bayes-by-backprop.
Multiple research groups (Ghosh and Basu 2018; Hooker and Vidyashankar 2014) developed “minimum divergence estimation” approaches, minimising other divergences:
- Hellinger distance: \(H(P,Q) = \frac{1}{2}\int (\sqrt{p(x)} - \sqrt{q(x)})^2 dx\)
- Total variation: \(TV(P,Q) = \frac{1}{2}\int |p(x) - q(x)| dx\)
- α-divergences: \(D_\alpha(P \parallel Q) = \frac{1}{\alpha(\alpha-1)} \int p(x) \left[ \left(\frac{p(x)}{q(x)}\right)^{\alpha-1} - 1 \right] dx\)
Each targets different features of the data and offers different robustness properties under misspecification.
Nothing dominates (although actually optimal transport is a strong contender), and the main insight is that the choice of divergence should match your inferential or decision-theoretical goals.
So, how do we practically do inference that minimizes these divergences?
2 Gibbs Posterior
The simplest relaxation of classic Bayes is to make the likelihood a loss function; this is called a Gibbs posterior.
3 Generalized Bayesian Computation
I just saw a presentation on Dellaporta et al. (2022) which stakes a claim to the term “Generalized Bayesian Computation”. She mixes bootstrap, Bayes nonparametrics, MMD, and simulation-based inference in an M-open setting. I’m not sure which of the results are specific to that (impressive) paper, but Dellaporta name-checks Fong, Lyddon, and Holmes (2019), Lyddon, Walker, and Holmes (2018), Matsubara et al. (2022), Pacchiardi and Dutta (2022), Schmon, Cannon, and Knoblauch (2021).
There’s some interesting stuff happening in that group. Maybe this introductory post will be a good start: Generalising Bayesian Inference.
4 Generalized Variational Bayes Inference
Is this different from the above? I’m not even sure. The explicit variational approximation is hard to see in Dellaporta et al. (2022), whereas it is more obvious in Knoblauch, Jewson, and Damoulas (2022), so let’s claim they are different for now.
Knoblauch, Jewson, and Damoulas (2022) calls a variational approximation to the Gibbs posterior using a non-KL divergence a Generalized Variational Inference (GVI) method.1
The argument is that we can interpret the solution to the Robust Bayesian Inference problem variationally. We recall the average risk in mean form:
\[ R_n(\theta) = \frac{1}{n}\sum_{i=1}^n \ell(\theta, x_i) \]
which defines the Gibbs posterior measure as
\[ \pi_n(\theta) \propto \exp\{-\omega\, n\, R_n(\theta)\}\,\pi(\theta), \]
They argue it is equivalent to solving an optimisation problem over probability measures\[q(\theta)\] of the form
\[ q^* = \arg\min_{q \in \mathcal{P}(\Theta)} \left\{\omega\, n\, \mathbb{E}_q\bigl[R_n(\theta)\bigr] + \mathrm{KL}(q\| \pi)\right\}. \]
The GVI framework generalises this by allowing three free ingredients in the inference procedure compared to classic Bayesian (or variational Bayesian) inference:
- loss function \(\ell\), as in Gibbs posteriors
- a divergence function \(D\) (which doesn’t have to be the KL divergence)
- variational family \(\mathcal{Q}\).
The optimisation objective is
\[ q^* = \arg\min_{q\in \mathcal{Q}} \left\{\mathbb{E}_q\biggl[\sum_{i=1}^n \ell(\theta,x_i)\biggr] + D(q\| \pi)\right\}. \]
In this setup, when \(D\) is the KL divergence and the loss is the negative log-likelihood (properly scaled), the classical Bayesian posterior is recovered.
5 References
Footnotes
A name lab-grown to irritate me. As noted already, I reject naming things “Generalized” and I also think that “variational inference” as statisticians use it is a misnomer. I acknowledge I will not win this naming fight, but that does not mean I will not make a giant fuss about it in order to diminish my pain by spreading it around.↩︎
