“Generalized” Bayesian inference

Approximating the Gibbs posterior

2024-09-26 — 2025-05-30

Wherein the Limitations of Kullback-Leibler Divergence Under Model Misspecification Are Examined, and Alternative Divergence Measures Are Proposed as Replacements in Bayesian Updating.

Bayes

estimator distribution

functional analysis

Markov processes

Monte Carlo

neural nets

optimization

probabilistic algorithms

probability

SDEs

stochastic processes

1 ‘Generalized’

I dislike naming things “Generalized”, for all the obvious reasons. If biologists named Eukaryotes Generalized Prokaryotes they would be mocked. You cannot do this in the rest of science, but in machine learning, somehow it is normal, and so you get naming abominations like Generalized Generalized Models.

This naming contention will continue, and I will continue to hate it. So it goes.

“Generalized Bayesian inference” implies a certain flavour of approximation to a certain kind of relaxation of traditional Bayesian inference. Compare and contrast to “Approximate Bayesian inference”, which uses different approximations and relaxations, and is equally damnably named. Which is why, and why, are best demonstrated by example.

We follow the reasoning in Jewson, Smith, and Holmes (2018).

In the M-closed scenario — where the true data-generating process lies within the model class — Bayesian updating and maximum likelihood estimation are justified, because they minimise the KL divergence between the true distribution and the model.

But when the models are mis-specified, KL divergence — the mathematical backbone of these methods — isn’t such a slam dunk. A high-speed tour follows.

The Kullback-Leibler divergence between distributions P and Q is defined as:

\[D_{KL}(P \parallel Q) = \int p(x) \log \frac{p(x)}{q(x)} dx\]

In the well-specified case (“M-closed”), this divergence is exactly what maximum likelihood estimation minimises. When we maximise the likelihood:

\[\hat{\theta}_{MLE} = \arg\max_\theta \sum_{i=1}^n \log p(x_i | \theta)\]

We’re implicitly minimising \(D_{KL}(P_{true} \parallel P_\theta)\) where \(P_{true}\) is the true data-generating distribution. This is why MLE is “optimal”—it finds the model closest to reality in KL terms.

But here’s the problem: what if \(P_{true}\) isn’t in our model class at all? This is the M-open world. And, in fact, no model ever includes the truth; if we ignore that fact and pretend, weird stuff can happen (Shalizi 2009).

White (1982) showed that under misspecification, MLE still converges to something meaningful—the parameter that minimises KL divergence between the true distribution and our model class:

\[\theta^* = \arg\min_\theta D_{KL}(P_{true} \parallel P_\theta)\]

This was reassuring… is that reassuring? It is at best mildly reassuring: small KL discrepancy can still be pretty bad in terms of making terrible decisions.

Bissiri, Holmes, and Walker (2016) asked: why restrict ourselves to KL divergence at all? They developed a general Bayesian framework using arbitrary divergence measures \(d(\cdot, \cdot)\):

\[\pi(\theta | data) \propto \pi(\theta) \exp(-\eta \cdot d(data, model(\theta)))\]

where \(\eta\) controls the learning rate. When \(d\) is the log-likelihood, we recover standard Bayesian updating. But we can choose other divergences based on what aspects of the data matter most for our specific problem.

🚧TODO🚧: this looks a lot like Zellner’s argument (Zellner 1988) used in Bayes-by-backprop.

Multiple research groups (Ghosh and Basu 2018; Hooker and Vidyashankar 2014) developed “minimum divergence estimation” approaches, minimising other divergences:

Hellinger distance: \(H(P,Q) = \frac{1}{2}\int (\sqrt{p(x)} - \sqrt{q(x)})^2 dx\)
Total variation: \(TV(P,Q) = \frac{1}{2}\int |p(x) - q(x)| dx\)
α-divergences: \(D_\alpha(P \parallel Q) = \frac{1}{\alpha(\alpha-1)} \int p(x) \left[ \left(\frac{p(x)}{q(x)}\right)^{\alpha-1} - 1 \right] dx\)

Each targets different features of the data and offers different robustness properties under misspecification.

Nothing dominates (although actually optimal transport is a strong contender), and the main insight is that the choice of divergence should match your inferential or decision-theoretical goals.

So, how do we practically do inference that minimizes these divergences?

2 Gibbs Posterior

The simplest relaxation of classic Bayes is to make the likelihood a loss function; this is called a Gibbs posterior.

3 Generalized Bayesian Computation

I just saw a presentation on Dellaporta et al. (2022) which stakes a claim to the term “Generalized Bayesian Computation”. She mixes bootstrap, Bayes nonparametrics, MMD, and simulation-based inference in an M-open setting. I’m not sure which of the results are specific to that (impressive) paper, but Dellaporta name-checks Fong, Lyddon, and Holmes (2019), Lyddon, Walker, and Holmes (2018), Matsubara et al. (2022), Pacchiardi and Dutta (2022), Schmon, Cannon, and Knoblauch (2021).

There’s some interesting stuff happening in that group. Maybe this introductory post will be a good start: Generalising Bayesian Inference.

4 Generalized Variational Bayes Inference

Is this different from the above? I’m not even sure. The explicit variational approximation is hard to see in Dellaporta et al. (2022), whereas it is more obvious in Knoblauch, Jewson, and Damoulas (2022), so let’s claim they are different for now.

Knoblauch, Jewson, and Damoulas (2022) calls a variational approximation to the Gibbs posterior using a non-KL divergence a Generalized Variational Inference (GVI) method.¹

The argument is that we can interpret the solution to the Robust Bayesian Inference problem variationally. We recall the average risk in mean form:

\[ R_n(\theta) = \frac{1}{n}\sum_{i=1}^n \ell(\theta, x_i) \]

which defines the Gibbs posterior measure as

\[ \pi_n(\theta) \propto \exp\{-\omega\, n\, R_n(\theta)\}\,\pi(\theta), \]

They argue it is equivalent to solving an optimization problem over probability measures\[q(\theta)\] of the form

\[ q^* = \arg\min_{q \in \mathcal{P}(\Theta)} \left\{\omega\, n\, \mathbb{E}_q\bigl[R_n(\theta)\bigr] + \mathrm{KL}(q\| \pi)\right\}. \]

The GVI framework generalises this by allowing three free ingredients in the inference procedure compared to classic Bayesian (or variational Bayesian) inference:

loss function \(\ell\), as in Gibbs posteriors
a divergence function \(D\) (which doesn’t have to be the KL divergence)
variational family \(\mathcal{Q}\).

The optimization objective is

\[ q^* = \arg\min_{q\in \mathcal{Q}} \left\{\mathbb{E}_q\biggl[\sum_{i=1}^n \ell(\theta,x_i)\biggr] + D(q\| \pi)\right\}. \]

In this setup, when \(D\) is the KL divergence and the loss is the negative log-likelihood (properly scaled), the classical Bayesian posterior is recovered.

5 References

Bissiri, Holmes, and Walker. 2016. “A General Framework for Updating Belief Distributions.” Journal of the Royal Statistical Society: Series B (Statistical Methodology).

Csaba, and Sz. 2023. “Learning with Misspecified Models.”

Dellaporta, Knoblauch, Damoulas, et al. 2022. “Robust Bayesian Inference for Simulator-Based Models via the MMD Posterior Bootstrap.” arXiv:2202.04744 [Cs, Stat].

Fong, Lyddon, and Holmes. 2019. “Scalable Nonparametric Sampling from Multimodal Posteriors with the Posterior Bootstrap.” arXiv:1902.03175 [Cs, Stat].

Galvani, Bardelli, Figini, et al. 2021. “A Bayesian Nonparametric Learning Approach to Ensemble Models Using the Proper Bayesian Bootstrap.” Algorithms.

Ghosh, and Basu. 2018. “A New Family of Divergences Originating From Model Adequacy Tests and Application to Robust Statistical Inference.” IEEE Trans. Inf. Theor.

Grendár, and Judge. 2012. “Not All Empirical Divergence Minimizing Statistical Methods Are Created Equal?” AIP Conference Proceedings.

Grünwald, and Dawid. 2004. “Game Theory, Maximum Entropy, Minimum Discrepancy and Robust Bayesian Decision Theory.” The Annals of Statistics.

Hooker, and Vidyashankar. 2014. “Bayesian Model Robustness via Disparities.” TEST.

Jewson, Smith, and Holmes. 2018. “Principles of Bayesian Inference Using General Divergence Criteria.” Entropy.

Knoblauch, Jewson, and Damoulas. 2019. “Generalized Variational Inference: Three Arguments for Deriving New Posteriors.”

———. 2022. “An Optimization-Centric View on Bayes’ Rule: Reviewing and Generalizing Variational Inference.” Journal of Machine Learning Research.

Lyddon, Walker, and Holmes. 2018. “Nonparametric Learning from Bayesian Models with Randomized Objective Functions.” In Proceedings of the 32nd International Conference on Neural Information Processing Systems. NIPS’18.

Matsubara, Knoblauch, Briol, et al. 2022. “Robust Generalised Bayesian Inference for Intractable Likelihoods.” Journal of the Royal Statistical Society Series B: Statistical Methodology.

Pacchiardi, and Dutta. 2022. “Generalized Bayesian Likelihood-Free Inference Using Scoring Rules Estimators.” arXiv:2104.03889 [Stat].

Schmon, Cannon, and Knoblauch. 2021. “Generalized Posteriors in Approximate Bayesian Computation.” arXiv:2011.08644 [Stat].

Shalizi. 2009. “Dynamics of Bayesian Updating with Dependent Data and Misspecified Models.” Electronic Journal of Statistics.

White. 1982. “Maximum Likelihood Estimation of Misspecified Models.” Econometrica.

Zellner. 1988. “Optimal Information Processing and Bayes’s Theorem.” The American Statistician.

———. 2002. “Information Processing and Bayesian Analysis.” Journal of Econometrics, Information and Entropy Econometrics,.

Footnotes

A name lab-grown to irritate me. As noted already, I reject naming things “Generalized” and I also think that “variational inference” as statisticians use it is a misnomer. I acknowledge I will not win this naming fight, but that does not mean I will not make a giant fuss about it in order to diminish my pain by spreading it around.↩︎