Imprecise Bayesianism

2025-11-30 — 2026-01-14

Wherein Imprecise Bayesianism Is Presented as an Alternative for M‑open Problems, Where Beliefs Are Represented by Convex Sets of Distributions and PAC‑Bayes Generalization Bounds Are Invoked.

Bayes
how do science
statistics
Figure 1

In M-open Bayesian inference, we accept that our models are simplifications that don’t contain the true data-generating process, which leads to problems with standard Bayesian updating.

What alternative foundations or extensions of Bayesianism can better handle model misspecification?

1 Maximin updates over set priors

As far as I know, this is the classic approach. Informally: we have a set of prior distributions representing our beliefs. When we see data, we update each prior using Bayes’ rule to get a set of posteriors. When making decisions, we consider the worst-case expected utility across all posteriors in this set and choose the action that maximizes this worst-case utility.

Easy to say, but I haven’t really used this myself and suspect it’s annoying in practice (Camerer and Weber 1992; De Bock 2020; Giustinelli, Manski, and Molinari 2021; Hayashi 2021; Walley 1991).

2 PAC-Bayes methods

There’s a large body of work on this; I’m not an expert (Catoni 2007; Haddouche and Guedj 2022; Rivasplata et al. 2020; Rodríguez-Gálvez, Thobaben, and Skoglund 2024; Sucker and Ochs 2023; Thiemann et al. 2017).

PAC-Bayes (Probably Approximately Correct Bayesian) methods offer a theoretically grounded way to aggregate misspecified models in the M-open setting, giving high-probability generalization bounds without assuming realizability. They originate from (McAllester 1998, 1999) and were sharpened by Catoni (2007). The bounds control the expected risk of a posterior over hypotheses via the KL divergence to a prior, so we can, in principle, do robust model selection or ensembling even when the true data-generating process lies outside the model class \(\mathcal{M}\).

I’m unclear whether all stacking bounds are in fact of PAC-Bayes type.

PAC-Bayes seems to justify techniques like stacking by quantifying how well a data-dependent posterior generalizes: for concreteness, Catoni’s bound states that for bounded losses \(\ell\), \[\mathbb{E}_{Q}[\mathrm{risk}(h)] \leq \frac{1}{n} \sum \ell(h(x_i), y_i) + \sqrt{\frac{\mathrm{KL}(Q\|P) + \log(2n/\delta)}{2n}},\] with probability \(1-\delta\) over data of size \(n\), prior \(P\), and posterior \(Q\). This sidesteps M-closed assumptions in some sense, but don’t ask me for details.

Tools like paccube or pacbayes.py implement optimization of these bounds for neural nets and beyond.

3 Infrabayesianism

Infrabayesianism starts from the same assumption as M-open Bayes: the true state of the world is likely outside an agent’s hypothesis space. Kosoy and Appel (Kosoy and Appel 2020) replace the Bayesian prior — a single measure over environments — with a convex set of sub-probability measures (an infradistribution, or in the imprecise-probability vocabulary, a credal set of sa-measures). I presumed at first that this was just the standard set-prior approach above; it took me a while to see where it actually departs, and I’m not certain I have it.

They argue the framing is especially critical for embedded agents — agents that are part of an environment vastly more complex than they are. An AI can’t model every atom in its server room, let alone the universe, so its world model is necessarily incomplete. I confess I don’t follow that emphasis myself — I also can’t model every atom in anything I study, and I get by without infrabayesian reasoning. I should re-listen to Vanessa Kosoy’s interview on this theme; the framework is pitched at AI systems that must navigate deep and unavoidable uncertainty.

Instead of saying, “The probability of rain is 40%,” an infrabayesian agent might say, “The probability of rain is somewhere between 30% and 60%” — a convex set of distributions rather than a single one.

The standard Bayesian setup assumes the true environment lives somewhere in the support of the agent’s prior — realizability. Drop that and ordinary Bayesian updating gives no guarantees: the posterior concentrates on whichever in-class hypothesis happens to fit the data least badly, with no reason to be useful. Infra-Bayesianism drops realizability and substitutes a weaker fit-criterion built from convex sets. A hypothesis is an infradistribution \(\Theta\) — a convex, closed set of sa-measures, where an sa-measure is a pair \((\mu, b)\) of a sub-probability measure \(\mu\) and a constant \(b \in \mathbb{R}\) recording “how much loss we’re willing to absorb on this branch.” That constant \(b\) is, as far as I can tell, what distinguishes an infradistribution from the plain credal sets above — the imprecise-probability machinery with loss-bookkeeping bolted on. The agent’s preferences are a Knightian-uncertainty minimax over \(\Theta\) — evaluate each policy against the worst-case sa-measure in the set, with \(b\) entering as an additive constant:

\[ V(\pi) = \inf_{(\mu,\, b) \in \Theta} \big( \mathbb{E}_\mu[U(\pi)] + b \big). \]

Updating is the natural generalization — drop measures inconsistent with observation, renormalize the surviving set — and it admits a dynamic-consistency theorem analogous to the Bayesian one, so the agent’s worst-case promises stay coherent over time.

Two things this construction buys at once. Non-realizability: the true environment need not be in any single \(\mu\); it only needs to be dominated by some worst-case in the set, which is a strictly weaker assumption and the direct technical response to the big-world hypothesis above. Infra-Bayesian RL has regret bounds against non-realisable environments, which the classical Bayesian RL literature does not. Embeddedness: by letting the agent’s beliefs about its own future actions be imprecise rather than committed to a particular distribution, the construction sidesteps some of the diagonalization traps — 5-and-10, spurious counterfactuals — that the classical embedded agency literature wrestles with.

What it gives us: the cleanest existing technical answer to non-realizability, and a candidate decision theory for embedded agents that does not route through Löbian fixed-point gymnastics.

What it does not seem to give us: anything about bounded compute. The minimax over \(\Theta\) is taken as a primitive, with no story about who runs it or how long it takes.

The work mostly lives on the Alignment Forum and in Kosoy’s Learning-Theoretic Agenda writeups, and stringent learning-theory conferences outside my usual remit.

4 References

Alquier. 2024. User-Friendly Introduction to PAC-Bayes Bounds.” Foundations and Trends in Machine Learning.
Bissiri, Holmes, and Walker. 2016. A General Framework for Updating Belief Distributions.” Journal of the Royal Statistical Society: Series B (Statistical Methodology).
Camerer, and Weber. 1992. Recent Developments in Modeling Preferences: Uncertainty and Ambiguity.” Journal of Risk and Uncertainty.
Catoni. 2007. PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning.
Clyde, and Iversen. 2013. Bayesian Model Averaging in the M-Open Framework.” In Bayesian Theory and Applications.
Cozman. 2000. Credal Networks.” Artificial Intelligence.
De Bock. 2020. Archimedean Choice Functions.” Information Processing and Management of Uncertainty in Knowledge-Based Systems.
Giustinelli, Manski, and Molinari. 2021. Precise or Imprecise Probabilities? Evidence from Survey Response Related to Late-Onset Dementia.” Journal of the European Economic Association.
Haddouche, and Guedj. 2022. Online PAC-Bayes Learning.”
Hayashi. 2021. Collective Decision Under Ignorance.” Social Choice and Welfare.
Jansen. 2013. Robust Bayesian Inference Under Model Misspecification.”
Kelter. 2021. Bayesian Model Selection in the M-Open Setting — Approximate Posterior Inference and Subsampling for Efficient Large-Scale Leave-One-Out Cross-Validation via the Difference Estimator.” Journal of Mathematical Psychology.
Kosoy. 2021. Infra-Bayesian Physicalism: A Formal Theory of Naturalized Induction.”
Kosoy, and Appel. 2020. Infra-Bayesianism Sequence.”
Le, and Clarke. 2017. A Bayes Interpretation of Stacking for M-Complete and M-Open Settings.” Bayesian Analysis.
Masegosa. 2020. Learning Under Model Misspecification: Applications to Variational and Ensemble Methods.” In Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS’20.
McAllester. 1998. Some PAC-Bayesian Theorems.” In Proceedings of the Eleventh Annual Conference on Computational Learning Theory. COLT’ 98.
———. 1999. PAC-Bayesian Model Averaging.” In Proceedings of the Twelfth Annual Conference on Computational Learning Theory.
Rivasplata, Kuzborskij, Szepesvari, et al. 2020. PAC-Bayes Analysis Beyond the Usual Bounds.” In Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS ’20.
Rodríguez-Gálvez, Thobaben, and Skoglund. 2024. More PAC-Bayes Bounds: From Bounded Losses, to Losses with General Tail Behaviors, to Anytime Validity.” Journal of Machine Learning Research.
Shirvaikar, Walker, and Holmes. 2024. A General Framework for Probabilistic Model Uncertainty.”
Sucker, and Ochs. 2023. PAC-Bayesian Learning of Optimization Algorithms.” In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics.
Thiemann, Igel, Wintenberger, et al. 2017. A Strongly Quasiconvex PAC-Bayesian Bound.” In Proceedings of the 28th International Conference on Algorithmic Learning Theory.
Walley. 1991. Statistical Reasoning with Imprecise Probabilities.