1 ‘Generalized’
I dislike naming things “Generalized”, for all the obvious reasons. If biologists named Eukaryotes Generalized Prokaryotes they would be mocked. You cannot do this in the rest of the world, but in machine learning, somehow it is normal, and so you get naming abominations like Generalized Generalized Models.
This naming contention will continue, and I will continue to hate it. So it goes.
“Generalized Bayesian inference” implies a certain flavour of approximation to a certain kind of relaxation of traditional Bayesian inference. Compare and contrast to “Approximate Bayesian inference”, which uses different approximations and relaxations, and is equally damnably named. Which is why, and why, are best demonstrated by example.
We follow the reasoning in Jewson, Smith, and Holmes (2018).
In the M-closed scenario (where the true data-generating process is within the model class), Bayesian updating and maximum likelihood estimation are justified because they minimize the KL divergence between the true distribution and the model.
But! when the models are mis-specified, KL divergence—the mathematical backbone of these methods— is not such a slam dunk. High speed tour follows.
The Kullback-Leibler divergence between distributions P and Q is defined as:
In the well-specified case (“M-closed”), this divergence is exactly what maximum likelihood estimation minimizes. When we maximize the likelihood:
we’re implicitly minimizing
But here’s the problem: what if
White (1982) showed that under misspecification, MLE still converges to something meaningful—the parameter that minimizes KL divergence between the true distribution and our model class:
This was reassuring… is that reassuring? a small KL discrepancy can still be pretty bad in terms of making terrible decisions.
Bissiri, Holmes, and Walker (2016) asked: why restrict ourselves to KL divergence at all? They developed a general Bayesian framework using arbitrary divergence measures
where
TODO: this looks a lot like Zellner’s argument (Zellner 1988) used in Bayes-by-backprop.
Multiple research groups (Ghosh and Basu 2018; Hooker and Vidyashankar 2014) developed “minimum divergence estimation” approaches, minimising other divergences:
- Hellinger distance:
- Total variation:
- α-divergences:
Each targets different features of the data and offers different robustness properties under misspecification.
Nothing dominates (although actually optimal transport is a strong contender), and the main insight is that the choice of divergence should match your inferential or decision-theoretical goals.
So, how do we practically do inference that minimizes these divergences?
2 Gibbs Posterior
The simplest relaxation of classic Bayes is to make the likelihood a loss function; is called a Gibbs posterior.
3 Generalized Bayesian Computation
I just saw a presentation on Dellaporta et al. (2022) which stakes a claim to the term “Generalized Bayesian Computation”. She mixes bootstrap, Bayes nonparametrics, MMD, and simulation-based inference in an M-open setting. I’m not sure which of the results are specific to that (impressive) paper, but Dellaporta name-checks Fong, Lyddon, and Holmes (2019), Lyddon, Walker, and Holmes (2018), Matsubara et al. (2022), Pacchiardi and Dutta (2022), Schmon, Cannon, and Knoblauch (2021).
There’s some interesting stuff happening in that group. Maybe this introductory post will be a good start: Generalising Bayesian Inference.
4 Generalized Variational Bayes Inference
Is this different from the above? I’m not even sure. The explicit variational approximtion is hard to see in Dellaporta et al. (2022), whereas it is more obvious in Knoblauch, Jewson, and Damoulas (2022), so let’s claim they are different for now.
Knoblauch, Jewson, and Damoulas (2022) calls variational approximation to the Gibbs posterior using a non-KL divergence a Generalized Variational Inference (GVI) method.1
The argument is that we can interpret the solution to the Robust Bayesian Inference problem variationally. We recall the average risk in mean form:
which defines the Gibbs posterior measure as
They argue it is equivalent to solving an optimisation problem over probability measures
The GVI framework generalizes this by allowing three free ingredients in the inference procedure compared to the classic Bayesian (or variational Bayesian):
- loss function
, as in Gibbs posteriors - a divergence function
(which doesn’t have to be the KL divergence) - variational family
.
The optimisation objective is
In this setup, when
5 References
Footnotes
A name lab-grown to irritate me. As noted already, I reject naming things “Generalized” and I also think that “variational inference” as statisticians use it is a misnomer. I acknowledge I will not win this naming fight, but that does not mean I will not make a giant fuss about it on order to diminish my pain by spreading it around.↩︎