The Predictive Approach to Bayesian Inference

Purely-predictive models, “The Italian school”, martingale posteriors, …

2025-07-10 — 2025-08-04

Wherein the predictive stance is set forth and de Finetti’s representation is invoked to show that next‑observation forecasts are primary, martingale posteriors are proposed as prior‑free updates, and urn schemes are given as predictive constructions.

Bayes

how do science

likelihood free

probability

statistics

Bayesian inference is traditionally introduced via unknown parameters that interact to produce the things we expect to actually observe: we place a prior distribution on a parameter \(\theta\), update it to a posterior distribution given the data, and then use the posterior over that parameter to generate a posterior over actual observables. The predictive approach by contrast puts prediction at the centre – probabilities are assigned directly to future observables rather than to parameters. This view was championed (perhaps even invented, I think) by Bruno de Finetti, who argued that probability statements should only refer to observable events, with parameters serving as a convenient fiction linking past data to future outcomes. In other words, the only meaningful uncertainty is about things we might observe. Parameters are always merely nuisance parameters — they’re an auxiliary device.

We might care about this minimalism for philosophical or methodological reasons, which I think was De Finetti’s ideal.

There is also the surprising notion that predictive Bayes might unlock new tools for handling mis-specification in our models. Since all models are mis-specified, we might hope that looking at the predictive distribution is one escape from the the problems of M-open inference.

Finally, the great success of our age in machine learning (e.g. in neural nets) has been predicting observables rather than tying predictions to “true parameters”. And Predictive Bayes is just there waiting for us to use it. Indeed, after watching an seminar by Susan Wei, I feel that de Finetti’s perspective might be an interesting way of understanding when foundation models do optimal inference; most neural networks are best understood as purely predictive models anyway rather than parameter estimators, and purely predictive Bayes suddenly seems like a useful analogy.

1 Background and Notation

Throughout the next bit we need to distinguish distributions (uppercase letters) from their densities (lowercase letters) when both exist. We work on an infinite sequence of observations.

\[ X_1,\,X_2,\,\dots \]

Taking values in some space \(\mathcal X\). A finite batch of data is denoted

\[ x_{1:n} \;=\; (x_1,\dots,x_n). \]

1.1 Classic, parameter-based Bayesian inference

Parameter: \(\theta\in\Theta\).
Prior on \(\theta\): \(\Pi(d\theta)\) with density \(\pi(\theta)\).
Likelihood of data \(x_{1:n}\) under parameter \(\theta\):

\[ L(x_{1:n}\mid \theta) \] Often written \[ p(x_{1:n}\mid \theta)\bigr). \]
Posterior distribution of \(\theta\):

\[ \Pi(d\theta\mid x_{1:n}), \quad\text{with density } \pi(\theta\mid x_{1:n}) = \frac{L(x_{1:n}\mid \theta)\,\pi(\theta)}{\int_\Theta L(x_{1:n}\mid \vartheta)\,\pi(\vartheta)\,d\vartheta}. \]
Posterior predictive distribution for a new observation \(X_{n+1}\):

\[ P_{\text{param}}\bigl(X_{n+1}\in A\mid x_{1:n}\bigr) = \int_\Theta P_\theta(X_{n+1}\in A)\;\Pi(d\theta\mid x_{1:n}), \]

or in density form

\[ p(x_{n+1}\mid x_{1:n}) = \int_\Theta p(x_{n+1}\mid \theta)\,\pi(\theta\mid x_{1:n})\,d\theta. \]

2 Predictive (de Finetti–Style) Inference

Instead of introducing an unobserved \(\theta\), we work directly with the sequence of one-step-ahead predictive distributions:

\[ P_n(\,\cdot\mid x_{1:n}) \equiv \Pr\bigl\{X_{n+1}\in\cdot \,\bigm|\,X_{1:n}=x_{1:n}\bigr\}, \qquad n=0,1,2,\dots, \]

Where:

\(P_0\) is the prior predictive (our belief about \(X_1\) before we see the data).
\(P_n\) is the predictive rule after we see \(x_{1:n}\).

If these \(P_n\) satisfy coherence and exchangeability conditions (respectively Kolmogorov consistency, symmetry of \(x_{1:n}\)), then, by de Finetti’s theorem, there exists an (implicit) random measure \(F\) such that

\[ P_n(A\mid x_{1:n}) = \mathbb{E}\bigl[F(A)\mid x_{1:n}\bigr]. \]

The \(P_n\) are our Bayesian updates, but in principle we do not need to write down—or sample—the posterior on \(F\) because there is an existence proof that such updates are valid. Confusingly, many intros assume this will be enough, and yet I’m not satisfied. How do I calculate such things?

2.1 Key notation for predictives

Symbol	Meaning
\(P_n(\cdot\mid x_{1:n})\)	Predictive distribution for \(X_{n+1}\) given data \(x_{1:n}\).
\(p_n(x_{n+1}\mid x_{1:n})\)	Density of \(P_n\) when it exists (use lowercase for densities).
\(\tilde F\)	Implicit random probability measure (“parameter” in de Finetti’s sense).
\(\Pi(dF)\)	Mixing measure on \(\tilde F\) (the de Finetti prior), often not written out.
\(\alpha,F_0\)	Hyperparameters in Dirichlet-process examples: \(\alpha\) is the concentration, \(F_0\) the base.

The parameter view is familiar to more people. We write down \(\pi(\theta)\) and \(p(x\mid\theta)\), update to \(\pi(\theta\mid x)\), then integrate out \(\theta\) to predict.
In the predictive view we change emphasis: We directly specify how we would predict the next datum at each step. If our predictive rule is coherent, there automatically exists some Bayes model behind it.

For many modern nonparametric and robust methods, it’s often easier to create the \(P_n\) than to choose a high-dimensional prior. In some cases (e.g. Fortini–Petrone’s predictive resampling or Fong et al.’s martingale posteriors) we never explicitly deal with \(\tilde F\) or \(\theta\) at all.

3 Practical implementation

You will notice that most of the theory of predictive Bayes is an elaborate and quite painful proof of symmetry properties for exponential-family models under interesting but not exactly state-of-the-art world models. You might ask: can we actually compute predictive Bayes in practice, rather than simply proving theorems about how it might work if we could? Would this be helpful for building intuition?

Yes, it turns out there are a couple of useful methods, especially from the field of in-context learning. In particular, Bayesian in-context learning via transformers, etc, seems to be a pretty good model for a pure Bayes predictive. Hollmann et al. (2023), a tabular data inference method, and Lee et al. (2023), a neural process regressor, both make this connection explicit and are a fun way of building practical intuition.

TODO: check out TabMGP (Ng et al. 2025) which explicitly attempts to be a processor of predictive distributions.

For me, starting with these practical methods is a good way to build intuition before diving into the more abstract theory.

I was clued in to this approach by Susan Wei and have written more about it at neural posteriors.

4 How it works

I’m still in the process of arguing with the literature about this section. what follows is a mixture of my content and the LLM content and not yet vetted for sanity.

4.1 Exchangeability and de Finetti’s Theorem

Exchangeability is the assumption that our probabilistic beliefs about a sequence of random variables don’t depend on the order in which we observe the data. Formally, \((X_1,\ldots,X_n)\) is exchangeable if for every permutation \(\pi\) of \({1,\dots,n}\), the joint distribution satisfies \(P(X_1\in dx_1,\dots,X_n\in dx_n) = P(X_{\pi(1)}\in dx_1,\dots,X_{\pi(n)}\in dx_n)\). De Finetti’s representation theorem is the foundation: any infinite exchangeable sequence is conditionally iid. There exists a (generally unknown) random distribution \(F\) such that, conditional on \(F\), the \(X_i\) are iid with law \(F\). In Bayesian terms, \(F\) plays the role of a parameter – in fact we can think of \(F\) as \(\theta\) and put a prior on \(F\). In measure-theoretic form: If \((X_n)_{n\ge1}\) is exchangeable on a space \(\mathcal{X}\), then there exists a probability measure \(\mu\) on the set of probability distributions over \(\mathcal{X}\) such that for every \(n\) and any measurable events \(A_1,\dots,A_n\subseteq\mathcal{X}\),

\[ P(X_1\in A_1,\dots,X_n\in A_n) \;=\; \int \prod_{i=1}^n F(A_i)\; \mu(dF)\,. \]

\(\mu\) is the de Finetti mixing measure or the law of the random \(F\). If \(\mu\) were known, we could first sample \(F\sim \mu\) (like drawing a random parameter) and then sample \(X_1,\dots,X_n\iid F\). Of course, \(\mu\) is typically not known — the theorem only asserts that it exists.

The joint distribution factors into a product of predictive terms. As noted earlier, \(P(x_1,\dots,x_n) = \int \prod_{i=1}^n F(dx_i),\mu(dF)\). By Fubini’s theorem, we can swap the order of integration and view this as iterative conditioning: first draw \(F\sim\mu\), then \(X_1\sim F\), next draw \(X_2\sim F\), and so on. The result is exactly \(P(x_1)P(x_2|x_1)\cdots P(x_n|x_{1:n-1})\). In fact, the one-step predictive distribution emerges naturally: \(P(X_{n+1}\in A \mid X_{1:n}=x_{1:n}) \;=\; \int F(A)\; \mu(dF\mid X_{1:n}=x_{1:n})\,,\) where \(\mu(\cdot\mid X_{1:n}=x_{1:n})\) is the posterior distribution of \(F\) given the observed data. This formula formalizes a simple idea: the predictive distribution for a new observation is the posterior mean of \(F\) (the “urn” distribution). For example, in a coin-flip scenario, \(F\) would be the true (but unknown) distribution of Heads/Tails, and given some data the predictive probability of Heads on the next flip is the posterior expected value of \(F({\text{Heads}})\). If we had a \(\operatorname{Beta}(a,b)\) prior on the coin’s bias \(p=F({\text{Heads}})\), this predictive is \(\frac{a + \text{(heads so far)}}{a+b+n}\) — the well-known Beta-Binomial rule. Taking \(a=b=1\) (a uniform prior) gives \((1+k)/(2+n)\) after \(k\) heads in \(n\) tosses, which is Laplace’s “rule of succession.” Notice the predictive viewpoint reproduces these formulas: it’s literally the posterior expectation of the unknown frequency.
Learning is reflected in the convergence of predictive distributions. Under exchangeability, the Strong Law of Large Numbers says the empirical distribution of observations converges almost surely to the underlying \(F\). In predictive terms, this means our predictive distribution eventually concentrates. A precise statement is: with probability 1, \(P(X_{n+1}\in \cdot \mid X_{1:n})\) converges (as \(n\to\infty\)) to a degenerate distribution that places all its mass on the limiting empirical frequency \(F,\). In other words, as we see more data, the sequence of predictive measures \((P_n)\) forms a martingale that converges to the true distribution \(F\) almost surely. This is a powerful (and arguably main) consistency property: no matter what the true \(F\) is, an exchangeable Bayesian will almost surely eventually have predictive probabilities that match \(F\) on every set \(A\). In classical terms, the posterior on \(F\) converges to a point mass at the truth (if the model is well-specified). This was first shown by Joseph Doob in 1949 using martingale theory. It provides a frequentist validation of Bayesian learning in the exchangeable case. From the predictive perspective, it says: if our predictive rule is coherent and we eventually see enough data, we will effectively discover the data-generating distribution — our forecasts become indistinguishable from frequencies in the long run.

As a time-series guy, one thing I found weird about this literature — and I’ll flag it for you, dear reader — is that everything here seems like it should extend to predicting arbitrary phenomena, but different authors introduce different assumptions about the generating process, and it’s not easy to work out which assumptions apply, at least for me. Sometimes we’re handling complicated dependencies; other times we’re not. AFAICT, the core ‘Predictive Bayes’ idea doesn’t impose many assumptions. But de Finetti’s actual proofs about exchangeability impose i.i.d.-like structure on the observations.

4.2 Predictive Distributions and Coherence Conditions

A predictive rule or predictive distribution sequence is a collection \({P_n(\cdot \mid x_{1:n}) : n\ge0,; x_{1:n}\in \mathcal{X}^n}\), where each \(P_n(\cdot\mid x_{1:n})\) is a probability measure for the next observation \(X_{n+1}\) given past observations \(x_{1:n}\). (For \(n=0\), we have \(P_0(\cdot)\), which is just the distribution of \(X_1\) before seeing any data.) Not every choice of predictive rules corresponds to a valid joint distribution — the rules must satisfy coherence constraints. The fundamental coherence condition is that there exists some joint probability law \(P\) on the sequence such that for all \(n\) and events \(A\subseteq\mathcal{X}\):

\[ P_n(A \mid X_{1:n}=x_{1:n}) \;=\; P(X_{n+1}\in A \mid X_{1:n}=x_{1:n}), \]

That is, the kernel \(P_n\) truly comes from that joint law. By the Ionescu–Tulcea extension theorem, if we can specify a family of conditional probabilities consistently for every finite stage, a unique process measure \(P\) on the sequence (on \((\mathcal{X})^\infty\)) is induced. Thus, one route to define a Bayesian model is to pick a predictive rule and invoke Ionescu–Tulcea to get the joint. However, ensuring consistency and exchangeability in the predictive rule is tricky; we have several existence proofs, but they’re not obviously constructive. Fortini, Ladelli, and Regazzini (2000) gave a description of the necessary and sufficient conditions for exchangeability in terms of predictives. In plain language, their conditions are:

Symmetry: If we condition on any past data \(x_{1:n}\), the predictive distribution \(P_n(\cdot \mid x_{1:n})\) must be a symmetric function of \(x_{1:n}\). That is, the predictive distribution depends on the past observations only through their multiset or empirical frequency. This is intuitively clear — if the order of past data doesn’t matter for joint probabilities, it shouldn’t matter for predicting the next observation either. So, for example, in an exchangeable coin toss model, \(P(X_{n+1}=\text{Heads}\mid X_{1:n})\) can only depend on the number of heads in \(n\) tosses (and \(n\)), not on the exact sequence.
Associativity / Consistency: This one is more technical. Roughly, it means the predictive rule must be internally consistent when extended to two steps. Formally, for any \(n\ge0\), for any events \(A,B\) in the state space, we require

\[ \int_A P_{n+1}(B \mid X_{1:n+1}=x_{1:n},x_{n+1}) \; P_n(dx_{n+1}\in A \mid x_{1:n}) \;=\; \int_B P_{n+1}(A \mid X_{1:n+1}=x_{1:n},x_{n+1}) \; P_n(dx_{n+1}\in B \mid x_{1:n})\,, \]

And this should hold for all past \(x_{1:n}\). Although this expression looks complicated, it basically says: if we consider two future time points \(n+1\) and \(n+2\), the joint predictive for \((X_{n+1}, X_{n+2})\) should be symmetric in those two, because the entire sequence is exchangeable. This ensures that our one-step predictive extends consistently to a two-step (and hence multi-step) predictive. In intuitive terms, condition (ii) is related to Kolmogorov consistency for the projective family of predictive distributions and the requirement of exchangeability on future samples. If symmetry (i) and this consistency (ii) hold, then there exists an exchangeable joint law producing those predictives. These conditions are the predictive analogue of de Finetti’s theorem: they characterize when a given predictive specification is “valid” (arises from some mixture model).

The takeaway is that to design a Bayesian model in predictive form, we must propose a rule \(P_n(\cdot\mid x_{1:n})\) that is symmetric and coherent across time. If we have such a rule, either from a classic parametric model or from some spicy extension, a de Finetti-type parameter \(\Theta\) (the random distribution \(F\)) is implicitly defined as the thing whose posterior predictive matches our rule.

Note, we get that for free by doing classic-flavoured parametric Bayes and assuming our likelihood is correct. The extra work here is to make that more general

The predictive approach often yields an implicit description of the prior on \(\Theta\) without having to write it down explicitly. A classic example:

Pólya’s Urn/Dirichlet Process.

Suppose we want an exchangeable model for draws \(X_i\) taking values in some space (say \(\mathbb{R}\) or a discrete set) such that: (a) the first draw \(X_1\) has some distribution \(F_0\) (a “baseline” measure), and (b) at each step, the probability of seeing a new value that has never been seen before is proportional to a constant \(\alpha\), while the probability of seeing a value already seen is proportional to how many times it was seen. Formally, we set:

\(P_0(\cdot) = F_0(\cdot)\),
For \(n\ge1\), given past data \(x_{1:n}\) with \(k\) distinct values \({y_1,\dots,y_k}\) appearing in counts \(n_1,\dots,n_k\), define \(P_n(X_{n+1}=y_j \mid x_{1:n}) = \frac{n_j}{\,\alpha + n\,}, \qquad j=1,\dots,k,\) and \(P_n(X_{n+1} \text{ is a new value} \mid x_{1:n}) = \frac{\alpha}{\,\alpha + n\,}.\) If a new value is to appear, it is drawn from the base distribution \(F_0\) (e.g. a new color in an urn is sampled from \(F_0\)).

It is easy to check symmetry: the formulas depend on \((n_1,\dots,n_k)\), the counts of past values, which is symmetric in the data. The consistency condition can also be verified (and is a special case of the general theorem above) – essentially it holds because of the “reinforcement” form of the process (adding two observations in either order leads to the same updated counts). Therefore, by the theorem, there is an exchangeable joint law corresponding to this predictive rule. Blackwell and MacQueen (1973) proved that law’s de Finetti measure is the Dirichlet Process \(\operatorname{DP}(\alpha F_0)\). In other words, specifying this predictive scheme is equivalent to assuming \(F\sim DP(\alpha F_0)\) as a prior on the distribution and then observing iid samples. But notice how much simpler the predictive description is: we can understand and simulate_the model just by following the above recipe, without any direct invocation of a Dirichlet or Beta function. This predictive view also makes some properties obvious: e.g. the probability that the next observation is new remains \(\alpha/(\alpha+n)\) which tends to 0 as \(n\) grows, implying eventually all draws are repeats of existing values – hence the limiting random distribution \(F\) is discrete (almost surely, it’s an atomic distribution with infinitely many potential atoms). All that can be “seen” directly from the predictive rule, whereas deriving it from the Dirichlet process definition might require more work. This example underscores how we can “build priors by thinking predictively.” any other BNP priors have similar stories – for instance, the Pitman–Yor process has a two-parameter predictive rule \(P(\text{new}) \propto \alpha + c k\) and \(P(X_{n+1}=y_j) \propto n_j - c\) for some \(0\le c<1\) (with some constraints to ensure non-negativity), yielding a rich-get-richer scheme with power-law behaviour in the counts. The underlying de Finetti measure in that case is a stable (or two-parameter Poisson–Dirichlet) process, but we can derive it by first principles via the predictive condition.

4.3 Parametric Models and Sufficient Statistics.

The predictive approach is fully compatible with parametric Bayesian models. If we have a parametric family \({f_\theta(x)}\) and a prior \(\pi(d\theta)\), the traditional posterior predictive is \(P(X_{n+1}\in A \mid X_{1:n}=x_{1:n}) = \int_{\Theta} f_\theta(A)\; \pi(d\theta \mid X_{1:n}=x_{1:n})\,.\) This obeys the coherence conditions automatically (since it came from a joint model). But an interesting viewpoint is gained by focusing on predictive sufficiency. In many models, the posterior predictive depends on the data only through some summary \(T(x_{1:n})\). For example, in the Beta–Binomial model (coin flips with Beta prior), \(T\) can be the number of heads \(k\) in \(n\) flips. In a normal model with known variance and unknown mean, \(T\) could be the sample mean. In general, if \(T(x_{1:n})\) is a sufficient statistic for \(\theta\), then the predictive distribution \(P_n(\cdot\mid x_{1:n})\) will be a function of \(T(x_{1:n})\). Fortini, Ladelli, and Regazzini (2000) discuss how predictive sufficiency relates to classical sufficiency. Essentially, a statistic \(T_n = T(X_{1:n})\) is predictively sufficient if \(P_n(X_{n+1}\in \cdot \mid X_{1:n}) = P_n(X_{n+1}\in \cdot \mid T_n)\) for all data – that is, the prediction for a new observation depends on past data only through \(T_n\). For an exponential family with conjugate prior, this \(T_n\) will usually be the same as the conventional sufficient statistic. Predictive sufficiency provides another angle: instead of deriving the posterior for \(\theta\), we can directly derive the predictive by updating \(T\). For example, in the normal-with-known-variance case, we can derive the Gaussian form of the predictive for the next observation by directly updating the mean and uncertainty of the sampling distribution, rather than explicitly updating a prior for \(\theta\). In summary, parametric Bayesian models yield predictive rules of a special form (ones that admit a finite-dimensional sufficient statistic). These are consistent with the general theory but represent cases where the predictive distributions lie on a lower-dimensional family. The predictive approach doesn’t conflict with parametric modelling – it generalizes it, and in practice it often yields the same results (since Bayes theorem ensures coherence). What it offers is a different interpretation of those results: our posterior is just a device to produce predictions, and we could have potentially produced those predictions without ever introducing \(\theta\) explicitly.

4.4 Martingales and the Martingale Posterior

One of the modern breakthroughs in predictive inference is recognizing the role of martingales in Bayesian updating. We saw earlier that the sequence of predictive distributions \((P_n)\) under an exchangeable model is a martingale (in the weak sense that \(P_{n}\) is the conditional expectation of \(P_{n+1}\) given current information) and it converges to the true distribution \(F\) almost surely. E. Fong, Holmes, and Walker (2021) took this insight further: if the conventional Bayes posterior predictive yields a martingale that converges to a degenerate distribution at the true \(F\), perhaps we can construct other martingales that converge to something else, representing different uncertainty quantification. The martingale posterior is built on this idea. They start from the premise: “the foundation of Bayesian inference is to assign a distribution on missing observations conditional on what has been observed.” Rather than starting with a prior on \(\theta\), they ask: what distribution would a Bayesian assign to all the not-yet-seen data \(Y_{n+1},Y_{n+2},\dots\) given the seen \(Y_{1:n}\)? If we follow standard Bayes, the answer is clear: it would be the product of posterior predictives, i.e. \(P(Y_{n+1:\infty}\in dy_{n+1:\infty}\mid Y_{1:n}=y_{1:n}) = \int \prod_{m=n+1}^\infty F(dy_m); \mu(dF \mid Y_{1:n}=y_{1:n})\). This complicated object is basically “\(F\) drawn from the posterior, then future \(Y\) iid from \(F\).” From it, we can derive the usual posterior for any parameter \(\theta = T(F)\) as the induced distribution of \(\theta\). In fact, Doob’s theorem showed that if we go this route, the distribution of \(\theta\) defined in this way equals the standard posterior for \(\theta\). So nothing new so far – it just recapitulates that the Bayesian posterior is equivalent to the Bayesian predictive for the sequence. But now comes the twist: we’re free to choose a different joint predictive for future data that isn’t derived from the prior-predictive formula. If we do so, we get a different “posterior” for \(\theta\) when we invert that relationship. The only requirement is that our chosen joint predictive for future data still respects that \(Y_{n+1:\infty}\), given \(Y_{1:n}\), forms a martingale as \(n\) grows (so that as \(n\to\infty\) it collapses appropriately). C. H. E. Fong (2021) introduce a family of such predictive rules (including some based on copula dependence structures to handle when observations are not iid). The resulting martingale posterior is the probability distribution on the parameter (or on any estimand of interest) that corresponds to these new predictives. It is a true posterior in the sense that it is an updated distribution on the quantity given the data, but it need not come from any prior via Bayes’ rule. In practice, we can think of it as a Bayesian-like posterior that uses an alternative updating rule. The martingale posterior is chosen to satisfy certain optimality or robustness properties (for instance, we can design it to minimize some loss on predictions, or to coincide with bootstrap resampling in large samples).

Fong et al. note connections between martingale posteriors and the Bayesian bootstrap and other resampling schemes. The term “martingale” highlights that the sequence of these posteriors (as \(n\) increases) is a martingale in the space of probability measures, which ensures coherence over time. This concept is quite new, and ongoing research is examining its properties – for example, how close is a martingale posterior to the ordinary posterior as \(n\) becomes large? Early results indicate that if the model is well-specified, certain martingale posteriors are consistent (converge on the true parameter) and even asymptotically normal like a standard posterior. But if the model is misspecified, a cleverly chosen martingale posterior might offer advantages in terms of robustness (since it was not derived from the wrong likelihood). The theory here is deep, combining Bayesian ideas with martingale convergence and requiring new technical tools to verify things like convergence in distribution of these posterior-like objects.

From a user’s perspective, however, the appeal is that we can get a posterior without specifying a prior or likelihood, by directly coding how we think future data should behave. This significantly lowers the barrier to doing Bayesian-style uncertainty quantification in complex problems – we can bypass the often difficult task of choosing a prior or fully specifying a likelihood and focus on predictions (which may be easier to elicit from experts or justify scientifically).

5 Practical Methods and Examples

Let’s walk through a few concrete examples and methods where the predictive approach is applied. We’ve already discussed the Pólya urn (Dirichlet process) and a simple Beta-Bernoulli model. Here we highlight additional applied scenarios:

In-context-learning: As we mentioned above, some neural networks give up on the proofs and just compute interesting Bayes predictive updates. See (Hollmann et al. 2023; Lee et al. 2023)
Bayesian Bootstrap (Rubin 1981): Suppose we have observed data \(x_1,\dots,x_n\) which we treat as a sample from some population distribution \(F\). The Bayesian bootstrap avoids choosing a parametric likelihood for \(F\) and instead puts a uniform prior on the space of all discrete distributions supported on \({x_1,\dots,x_n}\). In effect, we assume exchangeability and that the true \(F\) puts all its mass on the observed points (as would be the case if these \(n\) points were literally the entire population values, but we just don’t know their weights). The posterior for \(F\) given the data then turns out to be a \(\operatorname{Dirichlet}(1,1,\dots,1)\) on the point masses at \(x_1,\dots,x_n\). Consequently, the posterior predictive for a new observation \(X_{n+1}\) is \(P(X_{n+1}=x_i \mid X_{1:n}=x_{1:n}) = \frac{1}{n}\,,\) for each \(i=1,\dots,n\). In other words, the next observation is equally likely to be any of the observed values. This is precisely the Bayesian bootstrap’s predictive distribution, which is just the empirical distribution of the sample (sometimes with one extra point allowed to be new, but with a flat prior that new point gets zero mass posterior). The Bayesian bootstrap can be viewed as a limiting case of the Dirichlet process predictive rule when \(\alpha \to 0\) (or, dually, when I say “I believe all probability mass is already in the observed points”). It’s a prime example of how a predictive assumption (the next point is equally likely to be any observed point) leads to an implicit prior (\(\operatorname{Dirichlet}(1,1,\dots,1)\) on weights) and a posterior (the random weights after a Dirichlet draw). The Bayesian bootstrap is often used to generate approximate posterior samples for parameters like the mean or quantiles without having to assume any specific data distribution. This method has gained popularity in Bayesian data analysis, especially in cases where a parametric model is hard to justify. It is an embodiment of de Finetti’s idea: we directly express belief about future draws (they should resemble past draws in distribution) and that is our model.
Predictive Model Checking and Cross-Validation: In Bayesian model evaluation, a common practice is to use the posterior predictive distribution to check goodness-of-fit: simulate replicated data \(X_i^{\text{rep}}\) from \(P(X_i^{\text{rep}}\mid X_{1:n})\) and compare to observed \(X_i\). Any systematic difference may indicate model misfit. This looks very much like predictive approach: rather than testing hypotheses about parameters, we ask “does the model predict new data that look like the data we have?”.. Additionally, methods like leave-one-out cross-validation (LOO-CV) can be given a Bayesian justification via the predictive approach. The LOO-CV score is essentially the product of \(P(X_i \mid X_{-i})\) over all \(i\) (the probability of each left-out point under the predictive based on the rest). Selecting models by maximizing this score (or its logarithm) is equivalent to maximizing predictive fit. Some recent research (including E. Fong and Holmes (2020)) formally connects cross-validation to Bayesian marginal likelihood and even proposes cumulative cross-validation as a way to score models coherently. The philosophy is: a model is good if it predicts well, not just if it has high posterior probability a priori. By building model assessment on predictive distributions, we ensure the evaluation criteria align with the end-use of the model (prediction or forecasting).
Coresets and Large-Scale Bayesian Summarization: (Flores 2025) A very recent application of predictive thinking is in creating Bayesian coreset algorithms – these aim to compress large datasets into small weighted subsets that yield almost the same posterior inference. Traditionally, coreset construction tries to approximate the log-likelihood of the full data by a weighted log-likelihood of a subset (minimizing a KL divergence). However, this fails for complex non-iid models. Flores (2025) proposed to use a predictive coreset: choose a subset of points such that the posterior predictive distribution of the subset is close to that of the full data. In other words, rather than matching likelihoods, match how well the subset can predict new data like the full set would. This approach explicitly cites the predictive view of inference (E. Fong and Yiu 2024; Fortini and Petrone 2012) as inspiration. The result is an algorithm that works even for models where likelihoods are intractable (because it can operate on predictive draws). This is a cutting-edge example of methodological innovation driven by predictive thinking.
Machine Learning and Sequence Modelling: It’s worth noting that in machine learning, modern large models (like transformers) are often trained to do next-token prediction on sequences. In some recent conceptual work, researchers have drawn a connection between such pre-trained sequence models and de Finetti’s theory. Essentially, a large language model that’s been trained on tons of text is implicitly representing a predictive distribution for words given preceding words. If the data (text) were regarded as exchangeable in some blocks, the model is doing a kind of empirical Bayes (using the training corpus as prior experience) to predict new text. Some authors (Ye and Namkoong 2024) have even argued that in-context learning by these models is equivalent to Bayesian updating on latent features, “Bayesian inference à la de Finetti”. While these ideas are still speculative, they illustrate how the predictive perspective resonates in ML: the focus is entirely on \(P(\text{future tokens}\mid \text{past tokens})\). If we were to build an AI that learns like a Bayesian, it might well do so by honing its predictive distribution through experience, rather than by explicitly maintaining a distribution on parameters. This is essentially what these sequence models do, albeit not in a fully coherent probabilistic way. Application in (Hollmann et al. 2023; Lee et al. 2023)

6 History

I got an LLM to summarize the history for me:

7 Historical Timeline of the Predictive Framework

1930s – de Finetti’s Foundation: In 1937, Bruno de Finetti published his famous representation theorem for exchangeable sequences, laying the cornerstone of predictive Bayesian inference. An infinite sequence of observations \(X_1,X_2,\dots\) is exchangeable if its joint probability is invariant under permutation of indices. De Finetti’s theorem states that any infinite exchangeable sequence is equivalent to iid sampling from some latent random probability distribution \(F\); for suitable observables (say \(X_i\in\mathbb{R}\)):

\[ P(X_1\in dx_1,\dots,X_n\in dx_n) \;=\; \int \prod_{i=1}^n F(dx_i)\; \mu(dF)\,, \]

where \(\mu\) is a “mixing” measure on distribution functions \(F\) (this \(\mu\) serves as a prior over \(F\) in Bayesian terms). Intuitively, if we believe the \(X_i\) are exchangeable, we act as if there is some unknown true distribution \(F\) governing them; given \(F\), the data are iid. De Finetti emphasized that \(F\) itself is an unobservable construct – what matters are the predictive probabilities \(P(X_{n+1}\in A \mid X_{1:n}=x_{1:n})\) for future observations. His philosophical stance was that probability is about our belief in future observable events, not about abstract parameters. He often illustrated this with betting and forecasting, treating inference as updating predictive “previsions” (expectations of future quantities). De Finetti’s ideas formed the philosophical bedrock of the Italian school of subjective Bayesianism, shifting focus toward prediction.
1950s – Formalization of Exchangeability: Following de Finetti, mathematical statisticians solidified the theoretical underpinnings. Hewitt and Savage (1955) provided a rigorous existence proof for de Finetti’s representation via measure-theoretic extension theorems (ensuring a mixing measure \(\mu\) exists for any exchangeable law). This period established exchangeability as a fundamental concept in Bayesian theory. Simply put, exchangeability = “iid given some \(F\)”. This result, sometimes called de Finetti’s theorem, became a “cornerstone of modern Bayesian theory”. It means that specifying a prior on the parameter (or on \(F\)) is mathematically equivalent to specifying a predictive rule for the sequence. In fact, we can recover de Finetti’s mixture form by multiplying one-step-ahead predictive probabilities:

\[ P(x_1,\dots,x_n) = P(x_1) \, P(x_2\mid x_1)\cdots P(x_n\mid x_{1:n-1})\,, \]

And for an exchangeable model, this product must equal the integral above. This insight – that a joint distribution can be factorized into sequential predictive distributions – is central to the predictive approach, and also completely baffling to introduce to students.
1970s – Bayesian Nonparametrics and Urn Schemes: Decades later, de Finetti’s predictive philosophy found new life in Bayesian nonparametric methods. In 1973, Blackwell and MacQueen (1973) introduced the Pólya’s urn scheme as a constructive predictive rule for the Dirichlet Process (DP) prior, which Ferguson had proposed that same year as a nonparametric prior on distributions. Blackwell and MacQueen showed that if \(X_1\sim F_0\) (a base distribution) and for each \(n\ge1\), \(X_{n+1}\mid X_{1:n} \sim \frac{\alpha}{\alpha+n}F_0 + \frac{1}{\alpha+n}\sum_{i=1}^n \delta_{X_i}\,,\) then the sequence \((X_n)_{n\ge1}\) is exchangeable and in fact \(F_0\) mixed with the urn scheme defines a Dirichlet process law. In this predictive rule, with probability \(\frac{\alpha}{\alpha+n}\), the \((n+1)\)th draw is a new value sampled from \(F_0\), and with probability \(\frac{n_j}{\alpha+n}\), it repeats one of the previously seen values (specifically, it equals the j-th distinct value seen so far, which occurred \(n_j\) times). This elegant scheme generates clusters of identical values and is the basis of the Chinese restaurant process in machine learning. Importantly, it required no explicit mention of a parameter – the predictive probabilities themselves defined the model. The Dirichlet process became the canonical example of a prior that is constructed via predictive distributions. Around the same time, Cifarelli and Regazzini (1978) discussed Bayesian nonparametric problems under exchangeability, and Ewens’s sampling formula (Ewens 1972) in population genetics provided another famous predictive rule for random partitions of species. These developments all build rich new models by directly formulating how observations predict new ones.
1980s – Predictive Inference and Model Assessment: By the 1980s, the predictive viewpoint began influencing statistical practice and philosophy outside of nonparametrics. Seymour Geisser advanced the idea that predictive ability is the ultimate test of a model – he promoted predictive model checking and advocated using the posterior predictive distribution for model assessment and selection (foundational to modern cross-validation approaches). In 1981, Rubin introduced the Bayesian bootstrap, an alternative to the classical bootstrap, which can be seen as a predictive inferential method: it assumes an exchangeable model where the “prior” is that the \(n\) observed data points form a finite population from which future samples are drawn uniformly at random. The Bayesian bootstrap’s posterior predictive for a new observation is simply the empirical distribution of the observed sample (with random weights), which aligns with de Finetti’s view of directly assigning probabilities to future data without a parametric likelihood. Ghosh and Meeden (Ghosh and Meeden 1986; Ghosh 2021) further developed Bayesian predictive methods for finite population sampling, treating the unknown finite population values as exchangeable and focusing on predicting the unseen units – again, no explicit parametric likelihood was needed. These works kept alive the notion that Bayesian inference “a la de Finetti” – with predictions first – could be practically useful. However, at the time, mainstream Bayesian statistics still largely centred on parametric models and priors, so the predictive approach was a somewhat heterodox perspective, championed by a subset of Bayesian thinkers.
1990s – The Italian School and Generalized Exchangeability: The 1990s saw renewed theoretical interest in characterizing exchangeable structures via predictions. Partial exchangeability (where data have subgroup invariances, like Markov exchangeability or other structured dependence) became a focus. In 1995, Jim Pitman generalized the Pólya urn to a two-parameter family (the Pitman–Yor process), broadening the class of predictive rules to capture power-law behavior in frequencies (Pitman 1995). In Italy, scholars like Eugenio Regazzini, Pietro Muliere, and their collaborators began exploring reinforced urn processes and other predictive constructions for more complex sequences. For example, Pietro Muliere and Petrone (1993) applied predictive mixtures of Dirichlet processes in regression problems, and P. Muliere, Secchi, and Walker (2000) introduced reinforced urn models for survival data. These models were essentially Markov chains whose transition probabilities update with reinforcement (i.e. past observations feed back into future transition probabilities), and they showed such sequences are mixtures of Markov chains – a type of partially exchangeable structure. Throughout, the strategy was to start by positing a plausible form for the one-step predictive distribution and then deduce the existence and form of the underlying probability law or “prior.” This reversed the conventional approach: instead of specifying a prior then deriving predictions, we specify predictions and thereby define an implicit prior. By the end of the 90s, the groundwork was laid for a systematic predictive construction of Bayesian models.
2000s – Predictive Characterizations and New Priors: Fortini, Ladelli, and Regazzini (2000) formalized the conditions for a predictive rule to yield exchangeability. They gave precise necessary and sufficient conditions on a sequence of conditional distributions \((P_n)\) that ensure there exists some exchangeable joint law \(P\) producing them. In essence, they proved that symmetry (the predictive probabilities depend on data only through symmetric functions like counts) and a certain consistency (related to associative conditioning of future predictions) characterize exchangeability. This result (along with earlier work by Diaconis and Freedman on sufficiency) provided a rigorous predictive criterion: we can validate whether a proposed prediction rule is coherent (comes from some exchangeable model) without explicitly constructing the latent parameter. Around the same time, new priors in BNP were being defined via predictive structures. For instance, the species sampling models (Boothby, Pitman, etc.) were recognized as those exchangeable sequences with a general predictive form \(P(X_{n+1}= \text{new} \mid X_{1:n}) = \frac{\alpha + c k}{\alpha + n}\) which yields various generalizations of the Dirichlet process. The Italian school played a leading role: they worked out how popular nonparametric priors like Dirichlet processes, Pitman–Yor processes, and others can be derived from a sequence of predictive probabilities. Priors by prediction became a theme. Fortini and Petrone (2012) wrote a comprehensive review on predictive construction of priors for both exchangeable and partially exchangeable scenarios. They highlighted theoretical links and revisited classical results “to shed light on theoretical connections” among predictive constructions. By the end of the 2000s, it was clear that we could either start with a prior or directly with a predictive mechanism – the two routes were provably equivalent if done consistently, but the predictive route often yielded new insights.
2010s – Consolidation and Wider Adoption: In the 2010s, the predictive approach gained broader recognition and was increasingly connected to modern statistical learning. Fortini and Petrone continued to publish a series of works extending the theory: they explored predictive sufficiency (identifying what summary of data preserves all information for predicting new data), and they characterized a range of complex priors via predictive rules (from hierarchical priors to hidden Markov models built on predictive constructions). For example, they showed how an infinite Hidden Markov Model (used in machine learning for clustering time series) can be seen as a mixture of Markov chains, constructed by a sequence of predictive transition distributions. Meanwhile, machine learning researchers, notably in the topic modelling and Bayesian nonparametric clustering communities, adopted the language of exchangeable partitions (the Chinese restaurant process, Indian buffet process, etc., all essentially predictive rules). The review article Fortini and Petrone (2016) distilled the philosophy and noted how the predictive approach had become central both to Bayesian foundations and to practical modelling in nonparametrics and ML. Another development was the exploration of conditionally identically distributed (CID) sequences (weaker than full exchangeability) and other relaxations – these allow some trend or covariate effects while retaining a predictive structure. Researchers like Berti contributed here, defining models where only a subset of predictive probabilities are constrained by symmetry (Berti, Pratelli, and Rigo 2004, 2012). All these efforts reinforced that de Finetti’s perspective is not just philosophical musing – it leads to concrete new models and methods.
2020s – Martingale Posteriors and Prior-Free Bayesianism: Very recent years have witnessed a surge of interest in prior-free or prediction-driven Bayesian updating rules. Two parallel lines of work – one by Fong, Holmes, and Walker in the UK, and another by Berti, Rigo, and collaborators in Italy – have pushed the predictive approach to its logical extreme: conduct Bayesian inference entirely through predictive distributions, with no explicit prior at all. Edwin Fong’s thesis (C. H. E. Fong 2021) and subsequent papers introduced the Martingale Posterior framework. The core idea is to view the “parameter” as the infinite sequence of future (or missing) observations. If we had the entire population or the entire infinite sequence \(Y_{n+1:\infty}\), any parameter of interest (like the true mean, or the underlying distribution \(F\)) would be known exactly. Thus uncertainty about \(\theta\) is really uncertainty about the as-yet-unseen data. Fong et al. formalize this by directly assigning a joint predictive distribution for all future observations given the observed \(Y_{1:n}\). In notation, instead of a posterior \(p(\theta\mid Y_{1:n})\), they consider \(p(Y_{n+1:\infty}\mid Y_{1:n})\). This is a huge distribution (over an infinite sequence), but under exchangeability it encodes the same information as a posterior on \(\theta\). In fact, there is a one-to-one correspondence: if we choose the predictive distribution in the standard Bayesian way (by integrating the likelihood against a prior), then Doob’s martingale theorem implies the induced distribution on \(\theta\) is exactly the usual posterior. Fong and colleagues instead relax this: they allow us to specify any predictive mechanism (any sequence of one-step-ahead predictive densities) that seems reasonable for the problem, not necessarily derived from a likelihood-prior pair. As long as these predictive densities are coherent (a martingale in the sense of not contradicting themselves over time), we can define an implicit “posterior” for \(\theta\) or for any function of the unseen data. They dub this the martingale posterior distribution, which “returns Bayesian uncertainty directly on any statistic of interest without the need for the likelihood and prior”. In practice, they introduce an algorithm called predictive resampling to draw samples from the martingale posterior. Essentially, we iteratively sample pseudo-future observations from the chosen predictive rule to impute an entire fake “completion” of the data, use that to compute the statistic of interest, and repeat – thereby approximating the distribution of that statistic under the assumed predictive model. Martingale posteriors generalize Bayesian inference, subsuming standard posteriors when the predictive comes from a usual model, but also allowing robust or model-misspecified settings to be handled by choosing an alternative predictive (e.g. we might choose a heavy-tailed predictive density to guard against outliers, implicitly yielding a different “posterior”).

In parallel, Berti et al. (2023) developed a similar idea of Bayesian predictive inference without a prior. They work axiomatically with a user-specified sequence of predictives \({\sigma_n(\cdot\mid x_{1:n})}_{n\ge0}\) and establish general results for consistency and asymptotics of the resulting inference. One main advantage, as they note, is “no prior probability has to be selected” – the only inputs are the data and the predictive rule. These cutting-edge developments show how de Finetti’s viewpoint – once considered philosophically radical – is now driving methodological innovation for large-scale and robust Bayesian analysis. Today, the predictive approach is not only a cornerstone of Bayesian foundations but also an active area of research in its own right, influencing topics from machine learning (e.g. sequence modelling and meta-learning) to the theory of Bayesian asymptotics.

8 Questions

How do we extend this idea to causal inference, especially causal abstraction? Do we simply end up reinventing classical graphical models anyway?

9 Incoming

martingale posteriors | Xi’an’s Og
Chapter 2 of Post-Bayes seminar series

10 References

Berti, Dreassi, Leisen, et al. 2023. “Bayesian Predictive Inference Without a Prior.” Statistica Sinica.

Berti, Pratelli, and Rigo. 2004. “Limit Theorems for a Class of Identically Distributed Random Variables.” The Annals of Probability.

———. 2012. “Limit Theorems for Empirical Processes Based on Dependent Data.” Electronic Journal of Probability.

———. 2021. “A Central Limit Theorem for Predictive Distributions.” Mathematics.

Blackwell, and MacQueen. 1973. “Ferguson Distributions Via Polya Urn Schemes.” The Annals of Statistics.

Cifarelli, and Regazzini. 1978. “Nonparametric Statistical Problems Under Partial Exchangeability: The Use of Associative Means.” Annali Del’Instituto Di Matematica Finianziara Dell’Universita Di Torino, Serie.

De Finetti. 1937. “La Prévision: Ses Lois Logiques, Ses Sources Subjectives.” In Annales de l’institut Henri Poincaré.

Ewens. 1972. “The Sampling Theory of Selectively Neutral Alleles.” Theoretical Population Biology.

Flores. 2025. “Predictive Coresets.”

Fong, C. H. E. 2021. “The predictive view of Bayesian inference.”

Fong, Edwin, and Holmes. 2020. “On the Marginal Likelihood and Cross-Validation.” Biometrika.

Fong, Edwin, Holmes, and Walker. 2021. “Martingale Posterior Distributions.”

Fong, Edwin, and Yiu. 2024. “Asymptotics for Parametric Martingale Posteriors.”

Fortini, Ladelli, and Regazzini. 2000. “Exchangeability, Predictive Distributions and Parametric Models.” Sankhy?: The Indian Journal of Statistics, Series A (1961-2002).

Fortini, and Petrone. 2012. “Predictive Construction of Priors in Bayesian Nonparametrics.” Brazilian Journal of Probability and Statistics.

———. 2016. “Predictive distribution (de Finetti’s view).” In Wiley StatsRef: Statistics Reference Online.

———. 2025. “Exchangeability, Prediction and Predictive Modeling in Bayesian Statistics.” Statistical Science.

Ghosh. 2021. Bayesian Methods for Finite Population Sampling.

Ghosh, and Meeden. 1986. “Empirical Bayes Estimation in Finite Population Sampling.” Journal of the American Statistical Association.

Hewitt, and Savage. 1955. “Symmetric Measures on Cartesian Products.” Transactions of the American Mathematical Society.

Hollmann, Müller, Eggensperger, et al. 2023. “TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second.”

Hollmann, Müller, Purucker, et al. 2025. “Accurate Predictions on Small Data with a Tabular Foundation Model.” Nature.

Lee, Yun, Nam, et al. 2023. “Martingale Posterior Neural Processes.”

Meeden, and Ghosh. 1983. “Choosing Between Experiments: Applications to Finite Population Sampling.” The Annals of Statistics.

Muliere, Pietro, and Petrone. 1993. “A Bayesian Predictive Approach to Sequential Search for an Optimal Dose: Parametric and Nonparametric Models.” Journal of the Italian Statistical Society.

Muliere, P., Secchi, and Walker. 2000. “Urn Schemes and Reinforced Random Walks.” Stochastic Processes and Their Applications.

Ng, Fong, Frazier, et al. 2025. “TabMGP: Martingale Posterior with TabPFN.”

Pitman. 1993. “Dependence.” In Probability.

———. 1995. “Exchangeable and Partially Exchangeable Random Partitions.” Probability Theory and Related Fields.

Rubin. 1981. “The Bayesian Bootstrap.” Annals of Statistics.

Ye, and Namkoong. 2024. “Exchangeable Sequence Models Quantify Uncertainty Over Latent Concepts.”