Imprecise Bayesianism
2025-11-30 — 2026-01-14
Wherein Imprecise Bayesianism Is Presented as an Alternative for M‑open Problems, Where Beliefs Are Represented by Convex Sets of Distributions and PAC‑Bayes Generalization Bounds Are Invoked.
In M-open Bayesian inference, we accept that our models are simplifications that don’t contain the true data-generating process, which leads to problems with standard Bayesian updating.
What alternative foundations or extensions of Bayesianism can better handle model misspecification?
1 Maximin updates over set priors
As far as I know, this is the classic approach. Informally: we have a set of prior distributions representing our beliefs. When we see data, we update each prior using Bayes’ rule to get a set of posteriors. When making decisions, we consider the worst-case expected utility across all posteriors in this set and choose the action that maximizes this worst-case utility.
Easy to say, but I haven’t really used this myself and suspect it’s annoying in practice (Camerer and Weber 1992; De Bock 2020; Giustinelli, Manski, and Molinari 2021; Hayashi 2021; Walley 1991).
2 PAC-Bayes methods
There’s a large body of work on this; I’m not an expert (Catoni 2007; Haddouche and Guedj 2022; Rivasplata et al. 2020; Rodríguez-Gálvez, Thobaben, and Skoglund 2024; Sucker and Ochs 2023; Thiemann et al. 2017).
PAC-Bayes (Probably Approximately Correct Bayesian) methods offer a theoretically grounded way to aggregate misspecified models in the M-open setting, giving high-probability generalization bounds without assuming realizability. They originate from (McAllester 1998, 1999) and were sharpened by Catoni (2007). The bounds control the expected risk of a posterior over hypotheses via the KL divergence to a prior, so we can, in principle, do robust model selection or ensembling even when the true data-generating process lies outside the model class \(\mathcal{M}\).
I’m unclear whether all stacking bounds are in fact of PAC-Bayes type.
PAC-Bayes seems to justify techniques like stacking by quantifying how well a data-dependent posterior generalizes: for concreteness, Catoni’s bound states that for bounded losses \(\ell\), \[\mathbb{E}_{Q}[\mathrm{risk}(h)] \leq \frac{1}{n} \sum \ell(h(x_i), y_i) + \sqrt{\frac{\mathrm{KL}(Q\|P) + \log(2n/\delta)}{2n}},\] with probability \(1-\delta\) over data of size \(n\), prior \(P\), and posterior \(Q\). This sidesteps M-closed assumptions in some sense, but don’t ask me for details.
Tools like paccube or pacbayes.py implement optimization of these bounds for neural nets and beyond.
3 Infrabayesianism
Infrabayesianism starts from the same assumption as M-open Bayes: the true state of the world is likely outside an agent’s hypothesis space. Kosoy and Appel (Kosoy and Appel 2020) replace the Bayesian prior — a single measure over environments — with a convex set of sub-probability measures (an infradistribution, or in the imprecise-probability vocabulary, a credal set of sa-measures). I presumed at first that this was just the standard set-prior approach above; it took me a while to see where it actually departs, and I’m not certain I have it.
They argue the framing is especially critical for embedded agents — agents that are part of an environment vastly more complex than they are. An AI can’t model every atom in its server room, let alone the universe, so its world model is necessarily incomplete. I confess I don’t follow that emphasis myself — I also can’t model every atom in anything I study, and I get by without infrabayesian reasoning. I should re-listen to Vanessa Kosoy’s interview on this theme; the framework is pitched at AI systems that must navigate deep and unavoidable uncertainty.
Instead of saying, “The probability of rain is 40%,” an infrabayesian agent might say, “The probability of rain is somewhere between 30% and 60%” — a convex set of distributions rather than a single one.
The standard Bayesian setup assumes the true environment lives somewhere in the support of the agent’s prior — realizability. Drop that and ordinary Bayesian updating gives no guarantees: the posterior concentrates on whichever in-class hypothesis happens to fit the data least badly, with no reason to be useful. Infra-Bayesianism drops realizability and substitutes a weaker fit-criterion built from convex sets. A hypothesis is an infradistribution \(\Theta\) — a convex, closed set of sa-measures, where an sa-measure is a pair \((\mu, b)\) of a sub-probability measure \(\mu\) and a constant \(b \in \mathbb{R}\) recording “how much loss we’re willing to absorb on this branch.” That constant \(b\) is, as far as I can tell, what distinguishes an infradistribution from the plain credal sets above — the imprecise-probability machinery with loss-bookkeeping bolted on. The agent’s preferences are a Knightian-uncertainty minimax over \(\Theta\) — evaluate each policy against the worst-case sa-measure in the set, with \(b\) entering as an additive constant:
\[ V(\pi) = \inf_{(\mu,\, b) \in \Theta} \big( \mathbb{E}_\mu[U(\pi)] + b \big). \]
Updating is the natural generalization — drop measures inconsistent with observation, renormalize the surviving set — and it admits a dynamic-consistency theorem analogous to the Bayesian one, so the agent’s worst-case promises stay coherent over time.
Two things this construction buys at once. Non-realizability: the true environment need not be in any single \(\mu\); it only needs to be dominated by some worst-case in the set, which is a strictly weaker assumption and the direct technical response to the big-world hypothesis above. Infra-Bayesian RL has regret bounds against non-realisable environments, which the classical Bayesian RL literature does not. Embeddedness: by letting the agent’s beliefs about its own future actions be imprecise rather than committed to a particular distribution, the construction sidesteps some of the diagonalization traps — 5-and-10, spurious counterfactuals — that the classical embedded agency literature wrestles with.
What it gives us: the cleanest existing technical answer to non-realizability, and a candidate decision theory for embedded agents that does not route through Löbian fixed-point gymnastics.
What it does not seem to give us: anything about bounded compute. The minimax over \(\Theta\) is taken as a primitive, with no story about who runs it or how long it takes.
The work mostly lives on the Alignment Forum and in Kosoy’s Learning-Theoretic Agenda writeups, and stringent learning-theory conferences outside my usual remit.
