The Predictive Approach to Bayesian Inference
Purely-predictive models, “The Italian school”, martingale posteriors, …
2025-07-10 — 2025-08-04
Suspiciously similar content
Bayesian inference is traditionally introduced via unknown parameters: we place a prior on a parameter \(\theta\) and update it to a posterior given data, then use the posterior over this parameters to generate a posterior over actual observables. The predictive approach puts prediction at the center – probabilities are assigned directly to future observables rather than to parameters. This view was championed by Bruno de Finetti, who argued that probability statements should only refer to observable events, with parameters serving as a convenient fiction linking past data to future outcomes. In other words, the only meaningful uncertainty is about things we might observe. All parameters are nuisance parameters is an auxiliary object.
We might care about this minimialism for philosophical or methodological reasons, or we might care about it because the great success of the age in machine learning (e.g. in neural nets) has been predicting observables rather than trying to tie these predictions to “true parameters”. So maybe we need to think about Bayes in that context too?
Indeed, after watching a seminar by insight of Susan Wei, I feel that it might be an interesting way of understanding when foundation models do optimal inference; most neural networks are best understood as purely predictive models anyway rahter than parameter estimators; Purely predictive Bayes suddenly seems like it mihgt be an analogy.
1 Background and Notation
Throughout we need to distinguish distributions (uppercase letters) from their densities (lowercase letters) where both exist. We work on an infinite sequence of observations
\[ X_1,\,X_2,\,\dots \]
taking values in some space \(\mathcal X\). A finite batch of data is denoted
\[ x_{1:n} \;=\; (x_1,\dots,x_n). \]
1.1 Classic, Parameter-Based, Bayesian Inference
Parameter_: \(\theta\in\Theta\).
Prior distribution on \(\theta\): \(\Pi(d\theta)\), with density \(\pi(\theta)\).
Likelihood of data \(x_{1:n}\) under parameter \(\theta\):
\[ L(x_{1:n}\mid \theta) \quad\bigl(\text{often written }p(x_{1:n}\mid \theta)\bigr). \]
Posterior distribution of \(\theta\):
\[ \Pi(d\theta\mid x_{1:n}), \quad\text{with density } \pi(\theta\mid x_{1:n}) = \frac{L(x_{1:n}\mid \theta)\,\pi(\theta)}{\int_\Theta L(x_{1:n}\mid \vartheta)\,\pi(\vartheta)\,d\vartheta}. \]
Posterior predictive for a new observation \(X_{n+1}\):
\[ P_{\text{param}}\bigl(X_{n+1}\in A\mid x_{1:n}\bigr) = \int_\Theta P_\theta(X_{n+1}\in A)\;\Pi(d\theta\mid x_{1:n}), \]
or in density form
\[ p(x_{n+1}\mid x_{1:n}) = \int_\Theta p(x_{n+1}\mid \theta)\,\pi(\theta\mid x_{1:n})\,d\theta. \]
2 Predictive (de Finetti–Style) Inference
Rather than introduce an unobserved \(\theta\), we work directly with the sequence of one-step-ahead predictive distributions:
\[ P_n(\,\cdot\mid x_{1:n}) \equiv \Pr\bigl\{X_{n+1}\in\cdot \,\bigm|\,X_{1:n}=x_{1:n}\bigr\}, \qquad n=0,1,2,\dots, \]
where:
- \(P_0\) is the prior predictive (our belief about \(X_1\) before seeing data).
- \(P_n\) is the predictive rule after seeing \(x_{1:n}\).
If these \(P_n\) satisfy coherence and exchangeability conditions (resp. Kolmogorov consistency, symmetry in \(x_{1:n}\)), then by de Finetti’s theorem there exists an (implicit) random measure \(F\) so that
\[ P_n(A\mid x_{1:n}) = \mathbb{E}\bigl[F(A)\mid x_{1:n}\bigr]. \]
The \(P_n\) are our Bayesian updates, but in principle we do not need to write down need to write down—or sample—the posterior on \(F\) because there is an existence proof that such updates exist. Confusingly many intros assume this sill be sufficient to satisfy us! and yet, I am not satisfied. How do I calculate such things.
2.1 Key notation for predictives
Symbol | Meaning |
---|---|
\(P_n(\cdot\mid x_{1:n})\) | Predictive distribution for \(X_{n+1}\) given data \(x_{1:n}\). |
\(p_n(x_{n+1}\mid x_{1:n})\) | Density of \(P_n\) when it exists (lowercase for densities). |
\(\tilde F\) | Implicit random probability measure (“parameter” in de Finetti’s sense). |
\(\Pi(dF)\) | Mixing measure on \(\tilde F\) (the de Finetti prior), usually never written out. |
\(\alpha,F_0\) | Hyperparameters in Dirichlet-process examples: \(\alpha\) is concentration, \(F_0\) the base. |
- Parameter view is familiar to more people. we write down \(\pi(\theta)\) and \(p(x\mid\theta)\), update to \(\pi(\theta\mid x)\), then integrate out \(\theta\) to predict.
- Predictive view flips that: we directly specify how we would predict the next datum at each step. If our predictive rule is coherent, there automatically exists some Bayes model behind it.
For many modern nonparametric and robust methods, we could imagine creating the \(P_n\) is easier than choosing a high-dimensional prior. In some cases (e.g. Fortini–Petrone’s predictive resampling or Fong et al.’s martingale posteriors) we never explicitly deal with \(\tilde F\) or \(\theta\) at all.
3 Practical implementation
You will notice that most of the theory below is elaborate and quite painful proof of symmetry properties for exponential family models under interesting but not exactly state-of-the-art world models. You might ask: can we actually compute predict Bayes in practice? Would this be helpful for building intuition?
Yes, it turns out there are a couple of useful models. In particular, Bayesian In-Context-Learning via transformers etc seems to be a pretty good model for a pure bayes predictive. Hollmann et al. (2023), a tabular data inference method and Lee et al. (2023), aNeural process regressor both make this connection explciit and are graet to play with. I find this is massively helpful in building intuitions, as opposed to, say, starting by proving theorems about factorisations of measures under exchangeability.
4 How it works
4.1 Exchangeability and de Finetti’s Theorem
Exchangeability is that assumption that our probabilistic beliefs about a sequence of random variables do not depend on the order in which data are observed. Formally, \((X_1,\ldots,X_n)\) is exchangeable if for every permutation \(\pi\) of \({1,\dots,n}\), the joint distribution satisfies \(P(X_1\in dx_1,\dots,X_n\in dx_n) = P(X_{\pi(1)}\in dx_1,\dots,X_{\pi(n)}\in dx_n)\). De Finetti’s representation theorem is the foundation: any infinite exchangeable sequence is a conditionally iid sequence. There exists a (generally unknown) random distribution \(F\) such that, given \(F\), the \(X_i\) are iid with law \(F\). In Bayesian terms, \(F\) plays the role of a parameter – in fact we can think of \(F\) as \(\theta\) and put a prior on it. In measure-theoretic form: If \((X_n)_{n\ge1}\) is exchangeable on a space \(\mathcal{X}\), then there exists a probability measure \(\mu\) on the set of probability distributions over \(\mathcal{X}\) such that for every \(n\) and any measurable events \(A_1,\dots,A_n\subseteq\mathcal{X}\),
\[ P(X_1\in A_1,\dots,X_n\in A_n) \;=\; \int \prod_{i=1}^n F(A_i)\; \mu(dF)\,. \]
\(\mu\) is the de Finetti mixing measure or the law of the random \(F\). If \(\mu\) were known, we could sample \(F\sim \mu\) (this is like drawing a random parameter) and then sample \(X_1,\dots,X_n\iid F\). Of course \(\mu\) is typically not known – it is merely asserted to exist by the theorem.
The joint distribution factors into a product of predictive terms. As noted earlier, \(P(x_1,\dots,x_n) = \int \prod_{i=1}^n F(dx_i),\mu(dF)\). By Fubini’s theorem, we can swap integration order to see this as an iterative conditioning: first draw \(F\sim\mu\), then \(X_1\sim F\); next draw \(X_2\sim F\), etc. The result is exactly \(P(x_1)P(x_2|x_1)\cdots P(x_n|x_{1:n-1})\). In fact, the one-step predictive distribution emerges naturally: \(P(X_{n+1}\in A \mid X_{1:n}=x_{1:n}) \;=\; \int F(A)\; \mu(dF\mid X_{1:n}=x_{1:n})\,,\) where \(\mu(\cdot\mid X_{1:n}=x_{1:n})\) is the posterior distribution of \(F\) given the observed data. This formula formalizes a simple idea: the predictive distribution for a new observation is the posterior mean of \(F\) (the “urn” distribution). For example, in a coin-flip scenario, \(F\) would be the true (but unknown) distribution of Heads/Tails; given some data, the predictive probability of Heads on the next flip is the posterior expected value of \(F({\text{Heads}})\). If we had a \(\operatorname{Beta}(a,b)\) prior on the coin’s bias \(p=F({\text{Heads}})\), this predictive is \(\frac{a + \text{(heads so far)}}{a+b+n}\) – the well-known Beta-Binomial rule. Taking \(a=b=1\) (a uniform prior), this gives \((1+k)/(2+n)\) after \(k\) heads in \(n\) tosses, which is Laplace’s “rule of succession.” Notice the predictive viewpoint reproduces such formulas in this case: it is literally the posterior expectation of the unknown frequency.
Learning is reflected in the convergence of predictive distributions. Under exchangeability, the Strong Law of Large Numbers implies the empirical distribution of observations converges to the underlying \(F\) almost surely. In predictive terms, this means our predictive distribution eventually concentrates. A precise statement is: with probability 1, \(P(X_{n+1}\in \cdot \mid X_{1:n})\) converges (as \(n\to\infty\)) to a degenerate distribution that puts all its mass on the limiting empirical frequency \(F,\). In other words, as we see more data, the sequence of predictive measures \((P_n)\) forms a martingale that converges to the true distribution \(F\) almost surely. This is a powerful (main?) consistency property: no matter what the true \(F\) is, an exchangeable Bayesian will (almost surely) eventually have predictive probabilities that match \(F\) on every set \(A\). In classical terms, the posterior on \(F\) converges to a point mass at the truth (if the model is well-specified). This was first shown by Joseph Doob in 1949 using martingale theory. It provides a frequentist validation of Bayesian learning in the exchangeable case. From the predictive perspective, it says if our predictive rule is coherent and eventually we see enough data, we will effectively discover the data-generating distribution – our forecasts become indistinguishable from frequencies in the long run.
As a time series guy one thing I found weird about this literature, which I will flag for you, dear reader, is that essentially everything here looks like it should extend to predicting arbitrary phenomena, but different schools introduce different assumptions upon the generating process, and it is not easy to back out which, at least not for me. Sometimes we are handling complicated dependencies, other times not. AFAICT ‘Predictive Bayes’ as such doesn’t impose much. But de Finetti’s actual proofs about exchangeability is essentially imposing i.i.d. structure upon the observations.
4.2 Predictive Distributions and Coherence Conditions
A predictive rule or predictive distribution sequence is a collection \({P_n(\cdot \mid x_{1:n}) : n\ge0,; x_{1:n}\in \mathcal{X}^n}\) where each \(P_n(\cdot\mid x_{1:n})\) is a probability measure for the next observation \(X_{n+1}\) given past observations \(x_{1:n}\). (For \(n=0\), we have \(P_0(\cdot)\) which is just the distribution of \(X_1\) before seeing any data.) Not every arbitrary choice of predictive rules corresponds to a valid joint distribution — they must satisfy coherence constraints. The fundamental coherence condition is that there exists some joint probability law \(P\) on the sequence such that for all \(n\) and events \(A\subseteq\mathcal{X}\):
\[ P_n(A \mid X_{1:n}=x_{1:n}) \;=\; P(X_{n+1}\in A \mid X_{1:n}=x_{1:n}), \]
i.e. the kernel \(P_n\) truly comes from that joint law. By the Ionescu–Tulcea extension theorem, if we can specify a family of conditional probabilities in a consistent way for every finite stage, a unique process measure \(P\) on the sequence (on \((\mathcal{X})^\infty\)) is induced. Thus, one route to define a Bayesian model is: pick a predictive rule and invoke Ionescu–Tulcea to get the joint. However, ensuring consistency and exchangeability in the predictive rule is weird; we have a bunch of existence proofs but they are not obviously constructive. Fortini, Ladelli, and Regazzini (2000) gave a description of the necessary and sufficient conditions for exchangeability in terms of predictives. In plain language, their conditions are:
Symmetry: If we condition on any past data \(x_{1:n}\), the predictive distribution \(P_n(\cdot \mid x_{1:n})\) must be a symmetric function of \(x_{1:n}\). That is, it depends on the past observations only through their multiset or empirical frequency. This is intuitively clear – if the order of past data doesn’t matter for joint probabilities, it shouldn’t matter for predicting the next observation either. So, for example, in an exchangeable coin toss model, \(P(X_{n+1}=\text{Heads}\mid X_{1:n})\) can only depend on the number of heads in \(n\) tosses (and \(n\)), not on the exact sequence.
Associativity / Consistency: This one is more technical. Roughly it means the predictive rule must be internally consistent when extended to two steps. Formally, for any \(n\ge0\), for any events \(A,B\) in the state space, we require
\[ \int_A P_{n+1}(B \mid X_{1:n+1}=x_{1:n},x_{n+1}) \; P_n(dx_{n+1}\in A \mid x_{1:n}) \;=\; \int_B P_{n+1}(A \mid X_{1:n+1}=x_{1:n},x_{n+1}) \; P_n(dx_{n+1}\in B \mid x_{1:n})\,, \]
and this should hold for all past \(x_{1:n}\). Although this expression looks complicated, it is basically saying: if we consider two future time points \(n+1\) and \(n+2\), the joint predictive for \((X_{n+1}, X_{n+2})\) should be symmetric in those two (because the entire sequence is exchangeable). It ensures that our one-step predictive extends consistently to a two-step (and hence multi-step) predictive. In intuitive terms, condition (ii) is related to Kolmogorov consistency for the projective family of predictive distributions and the requirement of exchangeability on future samples. If symmetry (i) holds and this consistency (ii) holds, then there exists an exchangeable joint law producing those predictives. These are the predictive analog of de Finetti’s theorem: they characterize when a given predictive specification is “valid” (arises from some mixture model).
The takeaway is that to design a Bayesian model in predictive form, we must propose a rule \(P_n(\cdot\mid x_{1:n})\) that is symmetric and coherent across time. If we do have such a rule, either by being classic parametric, or by some spicy extension, a de Finetti-type parameter \(\Theta\) (the random distribution \(F\)) is implicitly defined as the thing whose posterior predictive matches our rule.
Note, we get that for free by just doing classic-falvoured parametric Bayes and assuming that our likelihood is correct. The extra work here is to make that more general
The predictive approach often yields an implicit description of the prior on \(\Theta\) without having to write it down explicitly. A classic example:
4.3 Parametric Models and Sufficient Statistics.
The predictive approach is fully compatible with parametric Bayesian models as well. If we have a parametric family \({f_\theta(x)}\) and a prior \(\pi(d\theta)\), the traditional posterior predictive is \(P(X_{n+1}\in A \mid X_{1:n}=x_{1:n}) = \int_{\Theta} f_\theta(A)\; \pi(d\theta \mid X_{1:n}=x_{1:n})\,.\) This will obey the coherence conditions automatically (since it came from a joint model). But an interesting viewpoint is gained by focusing on predictive sufficiency. In many models, the posterior predictive depends on the data only through some summary \(T(x_{1:n})\). For example, in the Beta–Binomial model (coin flips with Beta prior), \(T\) can be the number of heads \(k\) in \(n\) flips. In a normal model with known variance and unknown mean, \(T\) could be the sample mean. In general, if \(T(x_{1:n})\) is a sufficient statistic for \(\theta\), then the predictive distribution \(P_n(\cdot\mid x_{1:n})\) will be a function of \(T(x_{1:n})\). Fortini, Ladelli, and Regazzini (2000) discuss how predictive sufficiency relates to classical sufficiency. Essentially, a statistic \(T_n = T(X_{1:n})\) is predictively sufficient if \(P_n(X_{n+1}\in \cdot \mid X_{1:n}) = P_n(X_{n+1}\in \cdot \mid T_n)\) for all data – that is, the prediction for a new observation depends on past data only through \(T_n\). For an exponential family with conjugate prior, this \(T_n\) will usually be the same as the conventional sufficient statistic. Predictive sufficiency provides another angle: instead of deriving the posterior for \(\theta\), we can directly derive the predictive by updating \(T\). For example, in the normal-with-known-variance case, we can derive the Gaussian form of the predictive for the next observation by directly updating the mean and uncertainty of the sampling distribution, rather than explicitly updating a prior for \(\theta\). In summary, parametric Bayesian models yield predictive rules of a special form (ones that admit a finite-dimensional sufficient statistic). These are consistent with the general theory but represent cases where the predictive distributions lie on a lower-dimensional family. The predictive approach doesn’t conflict with parametric modelling – it generalizes it, and in practice it often yields the same results (since Bayes theorem ensures coherence). What it offers is a different interpretation of those results: our posterior is just a device to produce predictions, and we could have potentially produced those predictions without ever introducing \(\theta\) explicitly.
4.4 Martingales and the Martingale Posterior
One of the modern breakthroughs in predictive inference is recognizing the role of martingales in Bayesian updating. We saw earlier that the sequence of predictive distributions \((P_n)\) under an exchangeable model is a martingale (in the weak sense that \(P_{n}\) is the conditional expectation of \(P_{n+1}\) given current information) and it converges to the true distribution \(F\) almost surely. E. Fong, Holmes, and Walker (2021) took this insight further: if the conventional Bayes posterior predictive yields a martingale that converges to a degenerate distribution at the true \(F\), perhaps we can construct other martingales that converge to something else, representing different uncertainty quantification. The martingale posterior is built on this idea. They start from the premise: “the foundation of Bayesian inference is to assign a distribution on missing observations conditional on what has been observed.” Rather than begin with a prior on \(\theta\), they ask: what distribution would a Bayesian assign to all the not-yet-seen data \(Y_{n+1},Y_{n+2},\dots\) given the seen \(Y_{1:n}\)? If we follows standard Bayes, the answer is clear: it would be the product of posterior predictives, i.e. \(P(Y_{n+1:\infty}\in dy_{n+1:\infty}\mid Y_{1:n}=y_{1:n}) = \int \prod_{m=n+1}^\infty F(dy_m); \mu(dF \mid Y_{1:n}=y_{1:n})\). This complicated object is basically “\(F\) drawn from the posterior, then future \(Y\) iid from \(F\).” From it, we can derive the usual posterior for any parameter \(\theta = T(F)\) as the induced distribution of \(\theta\). In fact, Doob’s theorem showed that if we go this route, the distribution of \(\theta\) defined in this way equals the standard posterior for \(\theta\). So nothing new so far – it just recapitulates that the Bayesian posterior is equivalent to the Bayesian predictive for the sequence. But now comes the twist: we are free to choose a different joint predictive for the future that is not derived from the prior predictive formula. If we do so, we get a different “posterior” for \(\theta\) when we invert that relationship. The only requirement is that our chosen joint predictive for future data still respects that \(Y_{n+1:\infty}\), given \(Y_{1:n}\), forms a martingale as \(n\) grows (so that as \(n\to\infty\) it collapses appropriately). Fong et al. introduce a family of such predictive rules (including some based on copula dependence structures to handle when observations are not iid). The resulting martingale posterior is the probability distribution on the parameter (or on any estimand of interest) that corresponds to these new predictives. It is a true posterior in the sense that it is an updated distribution on the quantity given the data, but it need not come from any prior via Bayes’ rule. In practice, we can think of it as a Bayesian-like posterior that uses an alternative updating rule. The martingale posterior is chosen to satisfy certain optimality or robustness properties (for instance, we can design it to minimize some loss on predictions, or to coincide with bootstrap resampling in large samples).
Fong et al. note connections between martingale posteriors and the Bayesian bootstrap and other resampling schemes. The term “martingale” highlights that the sequence of these posteriors (as \(n\) increases) is a martingale in the space of probability measures, which ensures coherence over time. This concept is quite new, and ongoing research is examining its properties – for example, how close is a martingale posterior to the ordinary posterior as \(n\) becomes large? Early results indicate that if the model is well-specified, certain martingale posteriors are consistent (converge on the true parameter) and even asymptotically normal like a standard posterior. But if the model is misspecified, a cleverly chosen martingale posterior might offer advantages in terms of robustness (since it was not derived from the wrong likelihood). The theory here is deep, combining Bayesian ideas with martingale convergence and requiring new technical tools to verify things like convergence in distribution of these posterior-like objects. From a user’s perspective, however, the appeal is clear: we can get a posterior without specifying a prior or likelihood, by directly coding how we think future data should behave. This significantly lowers the barrier to doing Bayesian-style uncertainty quantification in complex problems – we can bypass the often difficult task of choosing a prior or fully specifying a likelihood and focus on predictions (which may be easier to elicit from experts or justify scientifically).
5 Summary of Theoretical Insights
The predictive framework rests on a few key insights proven by the above results:
Observables suffice: If we can specify our beliefs about observable sequences (one-step-ahead at a time) in a consistent way, we have done the essential job of modelling. Theorems like de Finetti’s and its predictive extensions guarantee that a parameter-based description exists if we want it, but it’s optional. “Bayes à la de Finetti” means one-step probabilities are the building blocks.
Exchangeability is a powerful symmetry: It grants a form of “sufficientness” to empirical frequencies. For instance, in an exchangeable model, the predictive distribution for tomorrow given all past data depends only on the distribution of past data, not on their order. This leads to natural Bayesian consistency (learning from frequencies) and justifies why we often reduce data to summary statistics.
Predictive characterization of models: Many complex models (like BNP priors or hierarchical mixture models) can be characterized by their predictive rules. In some cases this yields simpler derivations. For example, it’s easier to verify an urn scheme than to prove a Chinese restaurant process formula from scratch. Predictive characterizations also allow extending models by modifying predictive rules (e.g. creating new priors via new urn schemes, such as adding reinforcement or memory).
Philosophical clarity: The predictive view clarifies what the “parameter” really is – usually some aspect of an imaginary infinite population. As Fong et al. put it, “the parameter of interest [is] known precisely given the entire population”. This demystifies \(\theta\): it’s not a mystical quantity but simply a function of all unseen data. Thus, debating whether \(\theta\) “exists” is moot – what exists are data (observed or not yet observed). This philosophy can be very practical: it encourages us to check our models by simulating future data (since the model is the prediction rule), and to judge success by calibration of predictions.
Flexibility and robustness: By freeing ourselves from always specifying a likelihood, we can create “posterior-like” updates that may be more robust. For instance, we could specify heavier-tailed predictive densities than a Gaussian model would give, to reduce sensitivity to outliers, and still have a coherent updating scheme that quantifies uncertainty. This is one motivation behind general Bayesian approaches and martingale posteriors.
6 Practical Methods and Examples
Let’s walk through a few concrete examples and methods where the predictive approach is applied. We’ve already discussed the Pólya urn (Dirichlet process) and a simple Beta-Bernoulli model. Here we highlight additional applied scenarios:
In-context-learning: as we presaged above there are neural networks that give up on the proofs and just compute intersting Bayes predictive updates. See (Hollmann et al. 2023; Lee et al. 2023)
Bayesian Bootstrap (Rubin 1981): Suppose we have observed data \(x_1,\dots,x_n\) which we treat as a sample from some population distribution \(F\). The Bayesian bootstrap avoids choosing a parametric likelihood for \(F\) and instead puts a uniform prior on the space of all discrete distributions supported on \({x_1,\dots,x_n}\). In effect, we assume exchangeability and that the true \(F\) puts all its mass on the observed points (as would be the case if these \(n\) points were literally the entire population values, but we just don’t know their weights). The posterior for \(F\) given the data then turns out to be a Dirichlet(\(1,1,\dots,1\)) on the point masses at \(x_1,\dots,x_n\). Consequently, the posterior predictive for a new observation \(X_{n+1}\) is \(P(X_{n+1}=x_i \mid X_{1:n}=x_{1:n}) = \frac{1}{n}\,,\) for each \(i=1,\dots,n\). In other words, the next observation is equally likely to be any of the observed values. This is precisely the Bayesian bootstrap’s predictive distribution, which is just the empirical distribution of the sample (sometimes with one extra point allowed to be new, but with flat prior that new point gets zero mass posterior). The Bayesian bootstrap can be viewed as a limiting case of the Dirichlet process predictive rule when \(\alpha \to 0\) (or, dually, when I say “I believe all probability mass is already in the observed points”). It’s a prime example of how a predictive assumption (the next point is equally likely to be any observed point) leads to an implicit prior (Dirichlet(1,…,1) on weights) and a posterior (the random weights after a Dirichlet draw). The Bayesian bootstrap is often used to generate approximate posterior samples for parameters like the mean or quantiles without having to assume any specific data distribution. This method has gained popularity in Bayesian data analysis, especially in cases where a parametric model is hard to justify. It is an embodiment of de Finetti’s idea: we directly express belief about future draws (they should resemble past draws in distribution) and that is our model.
Predictive Model Checking and Cross-Validation: In Bayesian model evaluation, a common practice is to use the posterior predictive distribution to check goodness-of-fit: simulate replicated data \(X_i^{\text{rep}}\) from \(P(X_i^{\text{rep}}\mid X_{1:n})\) and compare to observed \(X_i\). Any systematic difference may indicate model misfit. This is fundamentally a predictive approach: rather than testing hypotheses about parameters, we ask “does the model predict new data that look like the data we have?”. It aligns perfectly with the predictive view that the ultimate goal of inference is accurate prediction. In fact, modern Bayesian workflow encourages predictive checks at every step. Additionally, methods like leave-one-out cross-validation (LOO-CV) can be given a Bayesian justification via the predictive approach. The LOO-CV score is essentially the product of \(P(X_i \mid X_{-i})\) over all \(i\) (the probability of each left-out point under the predictive based on the rest). Selecting models by maximizing this score (or its logarithm) is equivalent to maximizing predictive fit. Some recent research (including by Fong and others) formally connects cross-validation to Bayesian marginal likelihood and even proposes cumulative cross-validation as a way to score models coherently. The philosophy is: a model is good if it predicts well, not just if it has high posterior probability a priori. By building model assessment on predictive distributions, we ensure the evaluation criteria align with the end-use of the model (prediction or forecasting).
Coresets and Large-Scale Bayesian Summarization: (Flores 2025) A very recent application of predictive thinking is in creating Bayesian coreset algorithms – these aim to compress large datasets into small weighted subsets that yield almost the same posterior inference. Traditionally, coreset construction tries to approximate the log-likelihood of the full data by a weighted log-likelihood of a subset (minimizing a KL divergence). However, this fails for complex non-iid models. Flores (2025) proposed to use a predictive coreset: choose a subset of points such that the posterior predictive distribution of the subset is close to that of the full data. In other words, rather than matching likelihoods, match how well the subset can predict new data like the full set would. This approach explicitly cites the predictive view of inference (E. Fong and Yiu 2024; Fortini and Petrone 2012) as inspiration. The result is an algorithm that works even for models where likelihoods are intractable (because it can operate on predictive draws). This is a cutting-edge example of methodological innovation driven by predictive thinking.
Machine Learning and Sequence Modelling: It’s worth noting that in machine learning, modern large models (like transformers) are often trained to do next-token prediction on sequences. In some recent conceptual work, researchers have drawn a connection between such pre-trained sequence models and de Finetti’s theory. Essentially, a large language model that’s been trained on tons of text is implicitly representing a predictive distribution for words given preceding words. If the data (text) were regarded as exchangeable in some blocks, the model is doing a kind of empirical Bayes (using the training corpus as prior experience) to predict new text. Some authors (Ye and Namkoong 2024) have even argued that in-context learning by these models is equivalent to Bayesian updating on latent features, “Bayesian inference à la de Finetti”. While these ideas are still speculative, they illustrate how the predictive perspective resonates in ML: the focus is entirely on \(P(\text{future tokens}\mid \text{past tokens})\). If we were to build an AI that learns like a Bayesian, it might well do so by honing its predictive distribution through experience, rather than by explicitly maintaining a distribution on parameters. This is essentially what these sequence models do, albeit not in a fully coherent probabilistic way. Application in (Hollmann et al. 2023; Lee et al. 2023)
7 History
I got an LLM to summarize the history for me:
8 Historical Timeline of the Predictive Framework
1930s – de Finetti’s Foundation: In 1937, Bruno de Finetti published his famous representation theorem for exchangeable sequences, laying the cornerstone of predictive Bayesian inference. An infinite sequence of observations \(X_1,X_2,\dots\) is exchangeable if its joint probability is invariant under permutation of indices. De Finetti’s theorem states that any infinite exchangeable sequence is equivalent to iid sampling from some latent random probability distribution \(F\); for suitable observables (say \(X_i\in\mathbb{R}\)):
\[ P(X_1\in dx_1,\dots,X_n\in dx_n) \;=\; \int \prod_{i=1}^n F(dx_i)\; \mu(dF)\,, \]
where \(\mu\) is a “mixing” measure on distribution functions \(F\) (this \(\mu\) serves as a prior over \(F\) in Bayesian terms). Intuitively, if we believe the \(X_i\) are exchangeable, we act as if there is some unknown true distribution \(F\) governing them; given \(F\), the data are iid. De Finetti emphasized that \(F\) itself is an unobservable construct – what matters are the predictive probabilities \(P(X_{n+1}\in A \mid X_{1:n}=x_{1:n})\) for future observations. His philosophical stance was that probability is about our belief in future observable events, not about abstract parameters. He often illustrated this with betting and forecasting interpretations, effectively treating inference as an updating of predictive “previsions” (expectations of future quantities). De Finetti’s ideas formed the philosophical bedrock of the Italian school of subjective Bayesianism, shifting focus toward prediction.
1950s – Formalization of Exchangeability: Following de Finetti, mathematical statisticians solidified the theoretical underpinnings. Hewitt and Savage (1955) provided a rigorous existence proof for de Finetti’s representation via measure-theoretic extension theorems (ensuring a mixing measure \(\mu\) exists for any exchangeable law). This period established exchangeability as a fundamental concept in Bayesian theory. Simply put, exchangeability = “iid given some \(F\)”. This result, sometimes called de Finetti’s theorem, became a “cornerstone of modern Bayesian theory”. It means that specifying a prior on the parameter (or on \(F\)) is mathematically equivalent to specifying a predictive rule for the sequence. In fact, we can recover de Finetti’s mixture form by multiplying one-step-ahead predictive probabilities:
\[ P(x_1,\dots,x_n) = P(x_1) \, P(x_2\mid x_1)\cdots P(x_n\mid x_{1:n-1})\,, \]
and for an exchangeable model this product must equal the integral above. This insight – that a joint distribution can be factorized into sequential predictive distributions – is central to the predictive approach.
1970s – Bayesian Nonparametrics and Urn Schemes: Decades later, de Finetti’s predictive philosophy found new life in Bayesian nonparametric (BNP) methods. In 1973, Blackwell and MacQueen (1973) introduced the Pólya’s urn scheme as a constructive predictive rule for the Dirichlet Process (DP) prior, which Ferguson had proposed that same year as a nonparametric prior on distributions. Blackwell and MacQueen showed that if \(X_1\sim F_0\) (a base distribution) and for each \(n\ge1\), \(X_{n+1}\mid X_{1:n} \sim \frac{\alpha}{\alpha+n}F_0 + \frac{1}{\alpha+n}\sum_{i=1}^n \delta_{X_i}\,,\) then the sequence \((X_n)_{n\ge1}\) is exchangeable and in fact \(F_0\) mixed with the urn scheme defines a Dirichlet process law. In this predictive rule, with probability \(\frac{\alpha}{\alpha+n}\) the \((n+1)\)th draw is a new value sampled from \(F_0\), and with probability \(\frac{n_j}{\alpha+n}\) it repeats one of the previously seen values (specifically, it equals the j-th distinct value seen so far, which occurred \(n_j\) times). This elegant scheme generates clusters of identical values and is the basis of the Chinese restaurant process in machine learning. Importantly, it required no explicit mention of a parameter – the predictive probabilities themselves defined the model. The Dirichlet process became the canonical example of a prior that is constructed via predictive distributions. Around the same time, Cifarelli and Regazzini (1978) in Italy discussed Bayesian nonparametric problems under exchangeability, and Ewens’s sampling formula (Ewens 1972) in population genetics provided another famous predictive rule for random partitions of species. These developments showed the power of de Finetti’s idea: we can build rich new models by directly formulating how observations predict new ones.
1980s – Predictive Inference and Model Assessment: By the 1980s, the predictive viewpoint began influencing statistical practice and philosophy outside of nonparametrics. Seymour Geisser advanced the idea that predictive ability is the ultimate test of a model – he promoted predictive model checking and advocated using the posterior predictive distribution for model assessment and selection (foundational to modern cross-validation approaches). In 1981, Rubin introduced the Bayesian bootstrap, an alternative to the classical bootstrap, which can be seen as a predictive inferential method: it effectively assumes an exchangeable model where the “prior” is that the \(n\) observed data points are a finite population from which future samples are drawn uniformly at random. The Bayesian bootstrap’s posterior predictive for a new observation is simply the empirical distribution of the observed sample (with random weights), which aligns with de Finetti’s view of directly assigning probabilities to future data without a parametric likelihood. Ghosh and Meeden (Ghosh and Meeden 1986; Ghosh 2021) further developed Bayesian predictive methods for finite population sampling, treating the unknown finite population values as exchangeable and focusing on predicting the unseen units – again, no explicit parametric likelihood was needed. These works kept alive the notion that Bayesian inference “a la de Finetti” – with predictions first – could be practically useful. However, at the time, mainstream Bayesian statistics still largely centred on parametric models and priors, so the predictive approach was a somewhat heterodox perspective, championed by a subset of Bayesian thinkers.
1990s – The Italian School and Generalized Exchangeability: The 1990s saw renewed theoretical interest in characterizing exchangeable structures via predictions. Partial exchangeability (where data have subgroup invariances, like Markov exchangeability or other structured dependence) became a focus. In 1995, Jim Pitman generalized the Pólya urn to a two-parameter family (the Pitman–Yor process), broadening the class of predictive rules to capture power-law behavior in frequencies (Pitman 1995). In Italy, scholars like Eugenio Regazzini, Pietro Muliere, and their collaborators began exploring reinforced urn processes and other predictive constructions for more complex sequences. For example, Pietro Muliere and Petrone (1993) applied predictive mixtures of Dirichlet processes in regression problems, and P. Muliere, Secchi, and Walker (2000) introduced reinforced urn models for survival data. These models were essentially Markov chains whose transition probabilities update with reinforcement (i.e. past observations feed back into future transition probabilities), and they showed such sequences are mixtures of Markov chains – a type of partially exchangeable structure. Throughout, the strategy was to start by positing a plausible form for the one-step predictive distribution and then deduce the existence and form of the underlying probability law or “prior.” This reversed the conventional approach: instead of specifying a prior then deriving predictions, we specify predictions and thereby defined an implicit prior. By the end of the 90s, the groundwork was laid for a systematic predictive construction of Bayesian models.
2000s – Predictive Characterizations and New Priors: In 2000, a landmark paper (Fortini, Ladelli, and Regazzini 2000) formalized the conditions for a predictive rule to yield exchangeability. They gave precise necessary and sufficient conditions on a sequence of conditional distributions \((P_n)\) such that there exists some exchangeable joint law \(P\) producing them. In essence, they proved that symmetry (the predictive probabilities depend on data only through symmetric functions like counts) and a certain consistency (related to associative conditioning of future predictions) characterize exchangeability. This result (along with earlier work by Diaconis and Freedman on sufficiency) provided a rigorous predictive criterion: we can validate if a proposed prediction rule is coherent (comes from some exchangeable model) without explicitly constructing the latent parameter. Around the same time, new priors in BNP were being defined via predictive structures. For instance, the species sampling models (Boothby, Pitman, etc.) were recognized as those exchangeable sequences with a general predictive form \(P(X_{n+1}= \text{new} \mid X_{1:n}) = \frac{\alpha + c k}{\alpha + n}\) (for some constants \(c,\alpha\) and \(k\) distinct values so far), which yields various generalizations of the Dirichlet process. The Italian school played a leading role: they worked out how popular nonparametric priors like Dirichlet processes, Pitman–Yor processes, and others can be derived from a sequence of predictive probabilities. Priors by prediction became a theme. Fortini and Petrone (2012) wrote a comprehensive review on predictive construction of priors for both exchangeable and partially exchangeable scenarios. They highlighted theoretical connections and revisited classical results “to shed light on theoretical connections” among predictive constructions. By the end of the 2000s, it was clear that we could either start with a prior or directly with a predictive mechanism – the two routes were provably equivalent if done consistently, but the predictive route often yielded new insights.
2010s – Consolidation and Wider Adoption: In the 2010s, the predictive approach gained broader recognition and was increasingly connected to modern statistical learning. Fortini and Petrone continued to publish a series of works extending the theory: they explored predictive sufficiency (identifying what summary of data preserves all information for predicting new data), and they characterized a range of complex priors via predictive rules (from hierarchical priors to hidden Markov models built on predictive constructions). For example, they showed how an infinite Hidden Markov Model (used in machine learning for clustering time series) can be seen as a mixture of Markov chains, constructed by a sequence of predictive transition distributions. Meanwhile, machine learning researchers, notably in the topic modelling and Bayesian nonparametric clustering communities, adopted the language of exchangeable partitions (the Chinese restaurant process, Indian buffet process, etc., all essentially predictive rules). The review article Fortini and Petrone (2016) distilled the philosophy and noted how the predictive approach had become central both to Bayesian foundations and to practical modelling in nonparametrics and ML. Another development was the exploration of conditionally identically distributed (CID) sequences (weaker than full exchangeability) and other relaxations – these allow some trend or covariate effects while retaining a predictive structure. Researchers like Berti contributed here, defining models where only a subset of predictive probabilities are constrained by symmetry (Berti, Pratelli, and Rigo 2004, 2012). All these efforts reinforced that de Finetti’s perspective is not just philosophical musing – it leads to concrete new models and methods.
2020s – Martingale Posteriors and Prior-Free Bayesianism: Very recent years have witnessed a surge of interest in prior-free or prediction-driven Bayesian updating rules. Two parallel lines of work – one by Fong, Holmes, and Walker in the UK, and another by Berti, Rigo, and collaborators in Italy – have pushed the predictive approach to its logical extreme: conduct Bayesian inference entirely through predictive distributions, with no explicit prior at all. Edwin Fong’s D.Phil. thesis (C. H. E. Fong 2021) and subsequent papers introduced the Martingale Posterior framework. The core idea is to view the “parameter” as the infinite sequence of future (or missing) observations. If we had the entire population or the entire infinite sequence \(Y_{n+1:\infty}\), any parameter of interest (like the true mean, or the underlying distribution \(F\)) would be known exactly. Thus uncertainty about \(\theta\) is really uncertainty about the as-yet-unseen data. Fong et al. formalize this by directly assigning a joint predictive distribution for all future observations given the observed \(Y_{1:n}\). In notation, instead of a posterior \(p(\theta\mid Y_{1:n})\), they consider \(p(Y_{n+1:\infty}\mid Y_{1:n})\). This is a huge distribution (over an infinite sequence), but under exchangeability it encodes the same information as a posterior on \(\theta\). In fact, there is a one-to-one correspondence: if we choose the predictive distribution in the standard Bayesian way (by integrating the likelihood against a prior), then Doob’s martingale theorem implies the induced distribution on \(\theta\) is exactly the usual posterior. Fong and colleagues instead relax this: they allow the user to specify any predictive mechanism (any sequence of one-step-ahead predictive densities) that seems reasonable for the problem, not necessarily derived from a likelihood-prior pair. As long as these predictive densities are coherent (a martingale in the sense of not contradicting themselves over time), we can define an implicit “posterior” for \(\theta\) or for any function of the unseen data. They dub this the martingale posterior distribution, which “returns Bayesian uncertainty directly on any statistic of interest without the need for the likelihood and prior”. In practice, they introduce an algorithm called predictive resampling to draw samples from the martingale posterior. Essentially, we iteratively sample pseudo-future observations from the chosen predictive rule to impute an entire fake “completion” of the data, uses that to compute the statistic of interest, and repeats – thereby approximating the distribution of that statistic under the assumed predictive model. Martingale posteriors generalize Bayesian inference, subsuming standard posteriors when the predictive comes from a usual model, but also allowing robust or model-misspecified settings to be handled by choosing an alternative predictive (e.g. we might choose a heavy-tailed predictive density to guard against outliers, implicitly yielding a different “posterior”).
In parallel, Berti et al. (2023) developed a similar idea of Bayesian predictive inference without a prior. They work axiomatically with a user-specified sequence of predictives \({\sigma_n(\cdot\mid x_{1:n})}_{n\ge0}\) and establish general results for consistency and asymptotics of the resulting inference. One main advantage, as they note, is “no prior probability has to be selected” – the only inputs are the data and the predictive rule. These cutting-edge developments show how de Finetti’s viewpoint – once considered philosophically radical – is now driving methodological innovation for large-scale and robust Bayesian analysis. Today, the predictive approach is not only a cornerstone of Bayesian foundations but also an active area of research in its own right, influencing topics from machine learning (e.g. sequence modelling and meta-learning) to the theory of Bayesian asymptotics.
9 Questions
- How do we extend this idea to causal inference, especially causal abstraction? Do we simply end up reinventing classical graphical models anyway?