I don’t know much about this variant of Bayes, but the central idea is that we consider Bayes updating as a coherent betting rule and back everything else out from that. This gets us something like classic Bayes but with an even more austere approach to what probability is.

I am interested in this because, following an insight of Susan Wei’s, I note that it might be an interesting way of understanding when foundation models do optimal inference, since most neural networks are best understood as purely predictive models anyway.

Figure 1

Bayesian inference is traditionally introduced via unknown parameters: we place a prior on a parameter θ and update it to a posterior given data, then use the posterior over this parameters to generate a posterior over actual observables. The predictive approach puts prediction at the center – probabilities are assigneddirectly to future observables rather than to parameters. This view was championed by Bruno de Finetti, who argued that probability statements should only refer to observable events, with parameters serving as a convenient fiction linking past data to future outcomes. In other words, the only* meaningful uncertainty is about things we might observe. The parameter (if it exists at all) is an auxiliary object.

We might care about this for philosophical or methodological reasons, or we might care about it because the great success of the age in machine learning (e.g. in neural nets) has been predicting observables rather than trying to tie these predictions to “true parameters”. So maybe we need to think about Bayes in that context too?

1 Questions

2 Background and Notation

Throughout we need to distinguish distributions (uppercase letters) from their densities (lowercase letters) where both exist. We work on an infinite sequence of observations

X1,X2,

taking values in some space X. A finite batch of data is denoted

x1:n=(x1,,xn).

2.1 Classic (Parameter-Based) Bayesian Inference

  • Parameter_: θΘ.

  • Prior distribution on θ: Π(dθ), with density π(θ).

  • Likelihood of data x1:n under parameter θ:

    L(x1:nθ)(often written p(x1:nθ)).

  • Posterior distribution of θ:

    Π(dθx1:n),with density π(θx1:n)=L(x1:nθ)π(θ)ΘL(x1:nϑ)π(ϑ)dϑ.

  • Posterior predictive for a new observation Xn+1:

    Pparam(Xn+1Ax1:n)=ΘPθ(Xn+1A)Π(dθx1:n),

    or in density form

    p(xn+1x1:n)=Θp(xn+1θ)π(θx1:n)dθ.

3 Predictive (de Finetti–Style) Inference

Rather than introduce an unobserved θ, we work directly with the sequence of one-step-ahead predictive distributions:

Pn(x1:n)Pr{Xn+1|X1:n=x1:n},n=0,1,2,,

where:

  • P0 is the prior predictive (our belief about X1 before seeing data).
  • Pn is the predictive rule after seeing x1:n.

If these Pn satisfy coherence and exchangeability conditions (resp. Kolmogorov consistency, symmetry in x1:n), then by de Finetti’s theorem there exists an (implicit) random measure F so that

Pn(Ax1:n)=E[F(A)x1:n].

In effect, the Pn are our Bayesian updatwa, but we never need to write down—or sample—the posterior on F.

3.1 Key notation for predictives

Symbol Meaning
Pn(x1:n) Predictive distribution for Xn+1 given data x1:n.
pn(xn+1x1:n) Density of Pn when it exists (lowercase for densities).
F~ Implicit random probability measure (“parameter” in de Finetti’s sense).
Π(dF) Mixing measure on F~ (the de Finetti prior), usually never written out.
α,F0 Hyperparameters in Dirichlet-process examples: α is concentration, F0 the base.
  • Parameter view is familiar to more people. we write down π(θ) and p(xθ), update to π(θx), then integrate out θ to predict.
  • Predictive view flips that: we directly specify how we would predict the next datum at each step. If our predictive rule is coherent, there automatically exists some Bayes model behind it.

For many modern nonparametric and robust methods, we could imagine creating the Pn is easier than choosing a high-dimensional prior. In some cases (e.g. Fortini–Petrone’s predictive resampling or Fong et al.’s martingale posteriors) we never explicitly deal with F~ or θ at all.

4 Practical implementation

You will notice that most of the theory below is elaborate and quite painful proof of symmetry properties for exponential family models under interesting but not exactly state-of-the-art world models. You might ask: can we actually compute predict Bayes in practice? Would this be helpful for building intuition?

Yes, it turns out there are a couple of useful models. In particular, Bayesian In-Context-Learning via transformers etc seems to be a pretty good model for a pure bayes predictive. Hollmann et al. (), a tabular data inference method and Lee et al. (), aNeural process regressor both make this connection explciit and are graet to play with. I find this is massively helpful in building intuitions, as opposed to, say, starting by proving theorems about factorisations of measures under exchangeability.

5 How it works

I’m still in the process of arguing with the literature about this section. what follows is a mixture of my content and the LLM content and not yet vetted for sanity.

5.1 Exchangeability and de Finetti’s Theorem

Exchangeability is that assumption that our probabilistic beliefs about a sequence of random variables do not depend on the order in which data are observed. Formally, (X_1,,X_n) is exchangeable if for every permutation π of 1,,n, the joint distribution satisfies P(X_1dx_1,,X_ndx_n)=P(X_π(1)dx_1,,X_π(n)dx_n). De Finetti’s representation theorem is the foundation: any infinite exchangeable sequence is a conditionally iid sequence. There exists a (generally unknown) random distribution F such that, given F, the X_i are iid with law F. In Bayesian terms, F plays the role of a parameter – in fact we can think of F as θ and put a prior on it. In measure-theoretic form: If (X_n)_n1 is exchangeable on a space X, then there exists a probability measure μ on the set of probability distributions over X such that for every n and any measurable events A_1,,A_nX,

P(X1A1,,XnAn)=i=1nF(Ai)μ(dF).

μ is the de Finetti mixing measure or the law of the random F. If μ were known, we could sample Fμ (this is like drawing a random parameter) and then sample X_1,,X_n\iidF. Of course μ is typically not known – it is merely asserted to exist by the theorem.

  1. The joint distribution factors into a product of predictive terms. As noted earlier, P(x_1,,x_n)=_i=1nF(dx_i),μ(dF). By Fubini’s theorem, we can swap integration order to see this as an iterative conditioning: first draw Fμ, then X_1F; next draw X_2F, etc. The result is exactly P(x_1)P(x_2|x_1)P(x_n|x_1n1). In fact, the one-step predictive distribution emerges naturally: P(Xn+1AX1:n=x1:n)=F(A)μ(dFX1:n=x1:n), where μ(X_1n=x_1n) is the posterior distribution of F given the observed data. This formula formalizes a simple idea: the predictive distribution for a new observation is the posterior mean of F (the “urn” distribution). For example, in a coin-flip scenario, F would be the true (but unknown) distribution of Heads/Tails; given some data, the predictive probability of Heads on the next flip is the posterior expected value of F(Heads). If we had a Beta(a,b) prior on the coin’s bias p=F(Heads), this predictive is a+(heads so far)a+b+n – the well-known Beta-Binomial rule. Taking a=b=1 (a uniform prior), this gives (1+k)/(2+n) after k heads in n tosses, which is Laplace’s “rule of succession.” Notice the predictive viewpoint reproduces such formulas in this case: it is literally the posterior expectation of the unknown frequency.

  2. Learning is reflected in the convergence of predictive distributions. Under exchangeability, the Strong Law of Large Numbers implies the empirical distribution of observations converges to the underlying F almost surely. In predictive terms, this means our predictive distribution eventually concentrates. A precise statement is: with probability 1, P(X_n+1X_1n) converges (as n) to a degenerate distribution that puts all its mass on the limiting empirical frequency F,. In other words, as we see more data, the sequence of predictive measures (P_n) forms a martingale that converges to the true distribution F almost surely. This is a powerful (main?) consistency property: no matter what the true F is, an exchangeable Bayesian will (almost surely) eventually have predictive probabilities that match F on every set A. In classical terms, the posterior on F converges to a point mass at the truth (if the model is well-specified). This was first shown by Joseph Doob in 1949 using martingale theory. It provides a frequentist validation of Bayesian learning in the exchangeable case. From the predictive perspective, it says if our predictive rule is coherent and eventually we see enough data, we will effectively discover the data-generating distribution – our forecasts become indistinguishable from frequencies in the long run.

As a time series guy one thing I found weird about this literature, which I will flag for you, dear reader, is that essentially everything here looks like it should extend to predicting arbitrary phenomena, but different schools introduce different assumptions upon the generating process, and it is not easy to back out which, at least not for me. Sometimes we are handling complicated dependencies, other times not. AFAICT ‘Predictive Bayes’ as such doesn’t impose much. But de Finetti’s actual proofs about exchangeability is essentially imposing i.i.d. structure upon the observations.

5.2 Predictive Distributions and Coherence Conditions

A predictive rule or predictive distribution sequence is a collection P_n(x_1n):n0,;x_1nXn where each P_n(x_1n) is a probability measure for the next observation X_n+1 given past observations x_1n. (For n=0, we have P_0() which is just the distribution of X_1 before seeing any data.) Not every arbitrary choice of predictive rules corresponds to a valid joint distribution — they must satisfy coherence constraints. The fundamental coherence condition is that there exists some joint probability law P on the sequence such that for all n and events AX:

Pn(AX1:n=x1:n)=P(Xn+1AX1:n=x1:n),

i.e. the kernel P_n truly comes from that joint law. By the Ionescu–Tulcea extension theorem, if we can specify a family of conditional probabilities in a consistent way for every finite stage, a unique process measure P on the sequence (on (X)) is induced. Thus, one route to define a Bayesian model is: pick a predictive rule and invoke Ionescu–Tulcea to get the joint. However, ensuring consistency and exchangeability in the predictive rule is weird; we have a bunch of existence proofs but they are not obviously constructive. Fortini, Ladelli, and Regazzini () gave a description of the necessary and sufficient conditions for exchangeability in terms of predictives. In plain language, their conditions are:

  • Symmetry: If we condition on any past data x_1n, the predictive distribution P_n(x_1n) must be a symmetric function of x_1n. That is, it depends on the past observations only through their multiset or empirical frequency. This is intuitively clear – if the order of past data doesn’t matter for joint probabilities, it shouldn’t matter for predicting the next observation either. So, for example, in an exchangeable coin toss model, P(X_n+1=HeadsX_1n) can only depend on the number of heads in n tosses (and n), not on the exact sequence.

  • Associativity / Consistency: This one is more technical. Roughly it means the predictive rule must be internally consistent when extended to two steps. Formally, for any n0, for any events A,B in the state space, we require

    APn+1(BX1:n+1=x1:n,xn+1)Pn(dxn+1Ax1:n)=BPn+1(AX1:n+1=x1:n,xn+1)Pn(dxn+1Bx1:n),

    and this should hold for all past x_1n. Although this expression looks complicated, it is basically saying: if we consider two future time points n+1 and n+2, the joint predictive for (X_n+1,X_n+2) should be symmetric in those two (because the entire sequence is exchangeable). It ensures that our one-step predictive extends consistently to a two-step (and hence multi-step) predictive. In intuitive terms, condition (ii) is related to Kolmogorov consistency for the projective family of predictive distributions and the requirement of exchangeability on future samples. If symmetry (i) holds and this consistency (ii) holds, then there exists an exchangeable joint law producing those predictives. These are the predictive analog of de Finetti’s theorem: they characterize when a given predictive specification is “valid” (arises from some mixture model).

The takeaway is that to design a Bayesian model in predictive form, we must propose a rule P_n(x_1n) that is symmetric and coherent across time. If we do have such a rule, either by being classic parametric, or by some spicy extension, a de Finetti-type parameter Θ (the random distribution F) is implicitly defined as the thing whose posterior predictive matches our rule.

Note, we get that for free by just doing classic-falvoured parametric Bayes and assuming that our likelihood is correct. The extra work here is to make that more general

The predictive approach often yields an implicit description of the prior on Θ without having to write it down explicitly. A classic example:

Pólya’s Urn/Dirichlet Process.

Suppose we want an exchangeable model for draws X_i taking values in some space (say R or a discrete set) such that: (a) the first draw X_1 has some distribution F_0 (a “baseline” measure), and (b) at each step, the probability of seeing a new value that has never been seen before is proportional to a constant α, while the probability of seeing a value already seen is proportional to how many times it was seen. Formally, we set:

  • P_0()=F_0(),
  • For n1, given past data x_1n with k distinct values y_1,,y_k appearing in counts n_1,,n_k, define Pn(Xn+1=yjx1:n)=njα+n,j=1,,k, and Pn(Xn+1 is a new valuex1:n)=αα+n. If a new value is to appear, it is drawn from the base distribution F_0 (e.g. a new color in an urn is sampled from F_0).

It is easy to check symmetry: the formulas depend on (n_1,,n_k), the counts of past values, which is symmetric in the data. The consistency condition can also be verified (and is a special case of the general theorem above) – essentially it holds because of the “reinforcement” form of the process (adding two observations in either order leads to the same updated counts). Therefore, by the theorem, there is an exchangeable joint law corresponding to this predictive rule. Blackwell and MacQueen () proved that law’s de Finetti measure is the Dirichlet Process DP(αF_0). In other words, specifying this predictive scheme is equivalent to assuming FDP(αF_0) as a prior on the distribution and then observing iid samples. But notice how much simpler the predictive description is: we can understand and simulate the model just by following the above recipe, without any direct invocation of a Dirichlet or Beta function. This predictive view also makes some properties obvious: e.g. the probability that the next observation is new remains α/(α+n) which tends to 0 as n grows, implying eventually all draws are repeats of existing values – hence the limiting random distribution F is discrete (almost surely, it’s an atomic distribution with infinitely many potential atoms). All that can be “seen” directly from the predictive rule, whereas deriving it from the Dirichlet process definition might require more work. This example underscores how we can “build priors by thinking predictively.” any other BNP priors have similar stories – for instance, the Pitman–Yor process has a two-parameter predictive rule P(new)α+ck and P(X_n+1=y_j)n_jc for some 0c<1 (with some constraints to ensure nonnegativity), yielding a rich-get-richer scheme with power-law behavior in the counts. The underlying de Finetti measure in that case is a stable (or two-parameter Poisson–Dirichlet) process, but we can derive it by first principles via the predictive condition.

5.3 Parametric Models and Sufficient Statistics.

The predictive approach is fully compatible with parametric Bayesian models as well. If we have a parametric family f_θ(x) and a prior π(dθ), the traditional posterior predictive is P(Xn+1AX1:n=x1:n)=Θfθ(A)π(dθX1:n=x1:n). This will obey the coherence conditions automatically (since it came from a joint model). But an interesting viewpoint is gained by focusing on predictive sufficiency. In many models, the posterior predictive depends on the data only through some summary T(x_1n). For example, in the Beta–Binomial model (coin flips with Beta prior), T can be the number of heads k in n flips. In a normal model with known variance and unknown mean, T could be the sample mean. In general, if T(x_1n) is a sufficient statistic for θ, then the predictive distribution P_n(x_1n) will be a function of T(x_1n). Fortini, Ladelli, and Regazzini () discuss how predictive sufficiency relates to classical sufficiency. Essentially, a statistic T_n=T(X_1n) is predictively sufficient if P_n(X_n+1X_1n)=P_n(X_n+1T_n) for all data – that is, the prediction for a new observation depends on past data only through T_n. For an exponential family with conjugate prior, this T_n will usually be the same as the conventional sufficient statistic. Predictive sufficiency provides another angle: instead of deriving the posterior for θ, we can directly derive the predictive by updating T. For example, in the normal-with-known-variance case, we can derive the Gaussian form of the predictive for the next observation by directly updating the mean and uncertainty of the sampling distribution, rather than explicitly updating a prior for θ. In summary, parametric Bayesian models yield predictive rules of a special form (ones that admit a finite-dimensional sufficient statistic). These are consistent with the general theory but represent cases where the predictive distributions lie on a lower-dimensional family. The predictive approach doesn’t conflict with parametric modelling – it generalizes it, and in practice it often yields the same results (since Bayes theorem ensures coherence). What it offers is a different interpretation of those results: our posterior is just a device to produce predictions, and we could have potentially produced those predictions without ever introducing θ explicitly.

5.4 Martingales and the Martingale Posterior

One of the modern breakthroughs in predictive inference is recognizing the role of martingales in Bayesian updating. We saw earlier that the sequence of predictive distributions (P_n) under an exchangeable model is a martingale (in the weak sense that P_n is the conditional expectation of P_n+1 given current information) and it converges to the true distribution F almost surely. E. Fong, Holmes, and Walker () took this insight further: if the conventional Bayes posterior predictive yields a martingale that converges to a degenerate distribution at the true F, perhaps we can construct other martingales that converge to something else, representing different uncertainty quantification. The martingale posterior is built on this idea. They start from the premise: “the foundation of Bayesian inference is to assign a distribution on missing observations conditional on what has been observed.” Rather than begin with a prior on θ, they ask: what distribution would a Bayesian assign to all the not-yet-seen data Y_n+1,Y_n+2, given the seen Y_1n? If we follows standard Bayes, the answer is clear: it would be the product of posterior predictives, i.e. P(Y_n+1:dy_n+1:Y_1n=y_1n)=_m=n+1F(dy_m);μ(dFY_1n=y_1n). This complicated object is basically “F drawn from the posterior, then future Y iid from F.” From it, we can derive the usual posterior for any parameter θ=T(F) as the induced distribution of θ. In fact, Doob’s theorem showed that if we go this route, the distribution of θ defined in this way equals the standard posterior for θ. So nothing new so far – it just recapitulates that the Bayesian posterior is equivalent to the Bayesian predictive for the sequence. But now comes the twist: we are free to choose a different joint predictive for the future that is not derived from the prior predictive formula. If we do so, we get a different “posterior” for θ when we invert that relationship. The only requirement is that our chosen joint predictive for future data still respects that Y_n+1:, given Y_1n, forms a martingale as n grows (so that as n it collapses appropriately). Fong et al. introduce a family of such predictive rules (including some based on copula dependence structures to handle when observations are not iid). The resulting martingale posterior is the probability distribution on the parameter (or on any estimand of interest) that corresponds to these new predictives. It is a true posterior in the sense that it is an updated distribution on the quantity given the data, but it need not come from any prior via Bayes’ rule. In practice, we can think of it as a Bayesian-like posterior that uses an alternative updating rule. The martingale posterior is chosen to satisfy certain optimality or robustness properties (for instance, we can design it to minimize some loss on predictions, or to coincide with bootstrap resampling in large samples).

Fong et al. note connections between martingale posteriors and the Bayesian bootstrap and other resampling schemes. The term “martingale” highlights that the sequence of these posteriors (as n increases) is a martingale in the space of probability measures, which ensures coherence over time. This concept is quite new, and ongoing research is examining its properties – for example, how close is a martingale posterior to the ordinary posterior as n becomes large? Early results indicate that if the model is well-specified, certain martingale posteriors are consistent (converge on the true parameter) and even asymptotically normal like a standard posterior. But if the model is misspecified, a cleverly chosen martingale posterior might offer advantages in terms of robustness (since it was not derived from the wrong likelihood). The theory here is deep, combining Bayesian ideas with martingale convergence and requiring new technical tools to verify things like convergence in distribution of these posterior-like objects. From a user’s perspective, however, the appeal is clear: we can get a posterior without specifying a prior or likelihood, by directly coding how we think future data should behave. This significantly lowers the barrier to doing Bayesian-style uncertainty quantification in complex problems – we can bypass the often difficult task of choosing a prior or fully specifying a likelihood and focus on predictions (which may be easier to elicit from experts or justify scientifically).

6 Summary of Theoretical Insights

The predictive framework rests on a few key insights proven by the above results:

  • Observables suffice: If we can specify our beliefs about observable sequences (one-step-ahead at a time) in a consistent way, we have done the essential job of modelling. Theorems like de Finetti’s and its predictive extensions guarantee that a parameter-based description exists if we want it, but it’s optional. “Bayes à la de Finetti” means one-step probabilities are the building blocks.

  • Exchangeability is a powerful symmetry: It grants a form of “sufficientness” to empirical frequencies. For instance, in an exchangeable model, the predictive distribution for tomorrow given all past data depends only on the distribution of past data, not on their order. This leads to natural Bayesian consistency (learning from frequencies) and justifies why we often reduce data to summary statistics.

  • Predictive characterization of models: Many complex models (like BNP priors or hierarchical mixture models) can be characterized by their predictive rules. In some cases this yields simpler derivations. For example, it’s easier to verify an urn scheme than to prove a Chinese restaurant process formula from scratch. Predictive characterizations also allow extending models by modifying predictive rules (e.g. creating new priors via new urn schemes, such as adding reinforcement or memory).

  • Philosophical clarity: The predictive view clarifies what the “parameter” really is – usually some aspect of an imaginary infinite population. As Fong et al. put it, “the parameter of interest [is] known precisely given the entire population”. This demystifies θ: it’s not a mystical quantity but simply a function of all unseen data. Thus, debating whether θ “exists” is moot – what exists are data (observed or not yet observed). This philosophy can be very practical: it encourages us to check our models by simulating future data (since the model is the prediction rule), and to judge success by calibration of predictions.

  • Flexibility and robustness: By freeing ourselves from always specifying a likelihood, we can create “posterior-like” updates that may be more robust. For instance, we could specify heavier-tailed predictive densities than a Gaussian model would give, to reduce sensitivity to outliers, and still have a coherent updating scheme that quantifies uncertainty. This is one motivation behind general Bayesian approaches and martingale posteriors.

7 Practical Methods and Examples

Let’s walk through a few concrete examples and methods where the predictive approach is applied. We’ve already discussed the Pólya urn (Dirichlet process) and a simple Beta-Bernoulli model. Here we highlight additional applied scenarios:

  • In-context-learning: as we presaged above there are neural networks that give up on the proofs and just compute intersting Bayes predictive updates. See (; )

  • Bayesian Bootstrap (): Suppose we have observed data x_1,,x_n which we treat as a sample from some population distribution F. The Bayesian bootstrap avoids choosing a parametric likelihood for F and instead puts a uniform prior on the space of all discrete distributions supported on x_1,,x_n. In effect, we assume exchangeability and that the true F puts all its mass on the observed points (as would be the case if these n points were literally the entire population values, but we just don’t know their weights). The posterior for F given the data then turns out to be a Dirichlet(1,1,,1) on the point masses at x_1,,x_n. Consequently, the posterior predictive for a new observation X_n+1 is P(Xn+1=xiX1:n=x1:n)=1n, for each i=1,,n. In other words, the next observation is equally likely to be any of the observed values. This is precisely the Bayesian bootstrap’s predictive distribution, which is just the empirical distribution of the sample (sometimes with one extra point allowed to be new, but with flat prior that new point gets zero mass posterior). The Bayesian bootstrap can be viewed as a limiting case of the Dirichlet process predictive rule when α0 (or, dually, when I say “I believe all probability mass is already in the observed points”). It’s a prime example of how a predictive assumption (the next point is equally likely to be any observed point) leads to an implicit prior (Dirichlet(1,…,1) on weights) and a posterior (the random weights after a Dirichlet draw). The Bayesian bootstrap is often used to generate approximate posterior samples for parameters like the mean or quantiles without having to assume any specific data distribution. This method has gained popularity in Bayesian data analysis, especially in cases where a parametric model is hard to justify. It is an embodiment of de Finetti’s idea: we directly express belief about future draws (they should resemble past draws in distribution) and that is our model.

  • Predictive Model Checking and Cross-Validation: In Bayesian model evaluation, a common practice is to use the posterior predictive distribution to check goodness-of-fit: simulate replicated data X_irep from P(X_irepX_1n) and compare to observed X_i. Any systematic difference may indicate model misfit. This is fundamentally a predictive approach: rather than testing hypotheses about parameters, we ask “does the model predict new data that look like the data we have?”. It aligns perfectly with the predictive view that the ultimate goal of inference is accurate prediction. In fact, modern Bayesian workflow encourages predictive checks at every step. Additionally, methods like leave-one-out cross-validation (LOO-CV) can be given a Bayesian justification via the predictive approach. The LOO-CV score is essentially the product of P(X_iX_i) over all i (the probability of each left-out point under the predictive based on the rest). Selecting models by maximizing this score (or its logarithm) is equivalent to maximizing predictive fit. Some recent research (including by Fong and others) formally connects cross-validation to Bayesian marginal likelihood and even proposes cumulative cross-validation as a way to score models coherently. The philosophy is: a model is good if it predicts well, not just if it has high posterior probability a priori. By building model assessment on predictive distributions, we ensure the evaluation criteria align with the end-use of the model (prediction or forecasting).

  • Coresets and Large-Scale Bayesian Summarization: () A very recent application of predictive thinking is in creating Bayesian coreset algorithms – these aim to compress large datasets into small weighted subsets that yield almost the same posterior inference. Traditionally, coreset construction tries to approximate the log-likelihood of the full data by a weighted log-likelihood of a subset (minimizing a KL divergence). However, this fails for complex non-iid models. Flores () proposed to use a predictive coreset: choose a subset of points such that the posterior predictive distribution of the subset is close to that of the full data. In other words, rather than matching likelihoods, match how well the subset can predict new data like the full set would. This approach explicitly cites the predictive view of inference (; ) as inspiration. The result is an algorithm that works even for models where likelihoods are intractable (because it can operate on predictive draws). This is a cutting-edge example of methodological innovation driven by predictive thinking.

  • Machine Learning and Sequence Modelling: It’s worth noting that in machine learning, modern large models (like transformers) are often trained to do next-token prediction on sequences. In some recent conceptual work, researchers have drawn a connection between such pre-trained sequence models and de Finetti’s theory. Essentially, a large language model that’s been trained on tons of text is implicitly representing a predictive distribution for words given preceding words. If the data (text) were regarded as exchangeable in some blocks, the model is doing a kind of empirical Bayes (using the training corpus as prior experience) to predict new text. Some authors () have even argued that in-context learning by these models is equivalent to Bayesian updating on latent features, “Bayesian inference à la de Finetti”. While these ideas are still speculative, they illustrate how the predictive perspective resonates in ML: the focus is entirely on P(future tokenspast tokens). If we were to build an AI that learns like a Bayesian, it might well do so by honing its predictive distribution through experience, rather than by explicitly maintaining a distribution on parameters. This is essentially what these sequence models do, albeit not in a fully coherent probabilistic way. Application in (; )

8 History

I got an LLM to summarize the history for me:

9 Historical Timeline of the Predictive Framework

  • 1930s – de Finetti’s Foundation: In 1937, Bruno de Finetti published his famous representation theorem for exchangeable sequences, laying the cornerstone of predictive Bayesian inference. An infinite sequence of observations X_1,X_2, is exchangeable if its joint probability is invariant under permutation of indices. De Finetti’s theorem states that any infinite exchangeable sequence is equivalent to iid sampling from some latent random probability distribution F; for suitable observables (say X_iR):

    P(X1dx1,,Xndxn)=i=1nF(dxi)μ(dF),

    where μ is a “mixing” measure on distribution functions F (this μ serves as a prior over F in Bayesian terms). Intuitively, if we believe the X_i are exchangeable, we act as if there is some unknown true distribution F governing them; given F, the data are iid. De Finetti emphasized that F itself is an unobservable construct – what matters are the predictive probabilities P(X_n+1AX_1n=x_1n) for future observations. His philosophical stance was that probability is about our belief in future observable events, not about abstract parameters. He often illustrated this with betting and forecasting interpretations, effectively treating inference as an updating of predictive “previsions” (expectations of future quantities). De Finetti’s ideas formed the philosophical bedrock of the Italian school of subjective Bayesianism, shifting focus toward prediction.

  • 1950s – Formalization of Exchangeability: Following de Finetti, mathematical statisticians solidified the theoretical underpinnings. Hewitt and Savage (1955) provided a rigorous existence proof for de Finetti’s representation via measure-theoretic extension theorems (ensuring a mixing measure μ exists for any exchangeable law). This period established exchangeability as a fundamental concept in Bayesian theory. Simply put, exchangeability = “iid given some F. This result, sometimes called de Finetti’s theorem, became a “cornerstone of modern Bayesian theory”. It means that specifying a prior on the parameter (or on F) is mathematically equivalent to specifying a predictive rule for the sequence. In fact, we can recover de Finetti’s mixture form by multiplying one-step-ahead predictive probabilities:

    P(x1,,xn)=P(x1)P(x2x1)P(xnx1:n1),

    and for an exchangeable model this product must equal the integral above. This insight – that a joint distribution can be factorized into sequential predictive distributions – is central to the predictive approach.

  • 1970s – Bayesian Nonparametrics and Urn Schemes: Decades later, de Finetti’s predictive philosophy found new life in Bayesian nonparametric (BNP) methods. In 1973, Blackwell and MacQueen () introduced the Pólya’s urn scheme as a constructive predictive rule for the Dirichlet Process (DP) prior, which Ferguson had proposed that same year as a nonparametric prior on distributions. Blackwell and MacQueen showed that if X_1F_0 (a base distribution) and for each n1, Xn+1X1:nαα+nF0+1α+ni=1nδXi, then the sequence (X_n)_n1 is exchangeable and in fact F_0 mixed with the urn scheme defines a Dirichlet process law. In this predictive rule, with probability αα+n the (n+1)th draw is a new value sampled from F_0, and with probability n_jα+n it repeats one of the previously seen values (specifically, it equals the j-th distinct value seen so far, which occurred n_j times). This elegant scheme generates clusters of identical values and is the basis of the Chinese restaurant process in machine learning. Importantly, it required no explicit mention of a parameter – the predictive probabilities themselves defined the model. The Dirichlet process became the canonical example of a prior that is constructed via predictive distributions. Around the same time, Cifarelli and Regazzini () in Italy discussed Bayesian nonparametric problems under exchangeability, and Ewens’s sampling formula () in population genetics provided another famous predictive rule for random partitions of species. These developments showed the power of de Finetti’s idea: we can build rich new models by directly formulating how observations predict new ones.

  • 1980s – Predictive Inference and Model Assessment: By the 1980s, the predictive viewpoint began influencing statistical practice and philosophy outside of nonparametrics. Seymour Geisser advanced the idea that predictive ability is the ultimate test of a model – he promoted predictive model checking and advocated using the posterior predictive distribution for model assessment and selection (foundational to modern cross-validation approaches). In 1981, Rubin introduced the Bayesian bootstrap, an alternative to the classical bootstrap, which can be seen as a predictive inferential method: it effectively assumes an exchangeable model where the “prior” is that the n observed data points are a finite population from which future samples are drawn uniformly at random. The Bayesian bootstrap’s posterior predictive for a new observation is simply the empirical distribution of the observed sample (with random weights), which aligns with de Finetti’s view of directly assigning probabilities to future data without a parametric likelihood. Ghosh and Meeden (; ) further developed Bayesian predictive methods for finite population sampling, treating the unknown finite population values as exchangeable and focusing on predicting the unseen units – again, no explicit parametric likelihood was needed. These works kept alive the notion that Bayesian inference “a la de Finetti” – with predictions first – could be practically useful. However, at the time, mainstream Bayesian statistics still largely centred on parametric models and priors, so the predictive approach was a somewhat heterodox perspective, championed by a subset of Bayesian thinkers.

  • 1990s – The Italian School and Generalized Exchangeability: The 1990s saw renewed theoretical interest in characterizing exchangeable structures via predictions. Partial exchangeability (where data have subgroup invariances, like Markov exchangeability or other structured dependence) became a focus. In 1995, Jim Pitman generalized the Pólya urn to a two-parameter family (the Pitman–Yor process), broadening the class of predictive rules to capture power-law behavior in frequencies (). In Italy, scholars like Eugenio Regazzini, Pietro Muliere, and their collaborators began exploring reinforced urn processes and other predictive constructions for more complex sequences. For example, Pietro Muliere and Petrone () applied predictive mixtures of Dirichlet processes in regression problems, and P. Muliere, Secchi, and Walker () introduced reinforced urn models for survival data. These models were essentially Markov chains whose transition probabilities update with reinforcement (i.e. past observations feed back into future transition probabilities), and they showed such sequences are mixtures of Markov chains – a type of partially exchangeable structure. Throughout, the strategy was to start by positing a plausible form for the one-step predictive distribution and then deduce the existence and form of the underlying probability law or “prior.” This reversed the conventional approach: instead of specifying a prior then deriving predictions, we specify predictions and thereby defined an implicit prior. By the end of the 90s, the groundwork was laid for a systematic predictive construction of Bayesian models.

  • 2000s – Predictive Characterizations and New Priors: In 2000, a landmark paper () formalized the conditions for a predictive rule to yield exchangeability. They gave precise necessary and sufficient conditions on a sequence of conditional distributions (P_n) such that there exists some exchangeable joint law P producing them. In essence, they proved that symmetry (the predictive probabilities depend on data only through symmetric functions like counts) and a certain consistency (related to associative conditioning of future predictions) characterize exchangeability. This result (along with earlier work by Diaconis and Freedman on sufficiency) provided a rigorous predictive criterion: we can validate if a proposed prediction rule is coherent (comes from some exchangeable model) without explicitly constructing the latent parameter. Around the same time, new priors in BNP were being defined via predictive structures. For instance, the species sampling models (Boothby, Pitman, etc.) were recognized as those exchangeable sequences with a general predictive form P(X_n+1=newX_1n)=α+ckα+n (for some constants c,α and k distinct values so far), which yields various generalizations of the Dirichlet process. The Italian school played a leading role: they worked out how popular nonparametric priors like Dirichlet processes, Pitman–Yor processes, and others can be derived from a sequence of predictive probabilities. Priors by prediction became a theme. Fortini and Petrone () wrote a comprehensive review on predictive construction of priors for both exchangeable and partially exchangeable scenarios. They highlighted theoretical connections and revisited classical results “to shed light on theoretical connections” among predictive constructions. By the end of the 2000s, it was clear that we could either start with a prior or directly with a predictive mechanism – the two routes were provably equivalent if done consistently, but the predictive route often yielded new insights.

  • 2010s – Consolidation and Wider Adoption: In the 2010s, the predictive approach gained broader recognition and was increasingly connected to modern statistical learning. Fortini and Petrone continued to publish a series of works extending the theory: they explored predictive sufficiency (identifying what summary of data preserves all information for predicting new data), and they characterized a range of complex priors via predictive rules (from hierarchical priors to hidden Markov models built on predictive constructions). For example, they showed how an infinite Hidden Markov Model (used in machine learning for clustering time series) can be seen as a mixture of Markov chains, constructed by a sequence of predictive transition distributions. Meanwhile, machine learning researchers, notably in the topic modelling and Bayesian nonparametric clustering communities, adopted the language of exchangeable partitions (the Chinese restaurant process, Indian buffet process, etc., all essentially predictive rules). The review article Fortini and Petrone () distilled the philosophy and noted how the predictive approach had become central both to Bayesian foundations and to practical modelling in nonparametrics and ML. Another development was the exploration of conditionally identically distributed (CID) sequences (weaker than full exchangeability) and other relaxations – these allow some trend or covariate effects while retaining a predictive structure. Researchers like Berti contributed here, defining models where only a subset of predictive probabilities are constrained by symmetry (, ). All these efforts reinforced that de Finetti’s perspective is not just philosophical musing – it leads to concrete new models and methods.

  • 2020s – Martingale Posteriors and Prior-Free Bayesianism: Very recent years have witnessed a surge of interest in prior-free or prediction-driven Bayesian updating rules. Two parallel lines of work – one by Fong, Holmes, and Walker in the UK, and another by Berti, Rigo, and collaborators in Italy – have pushed the predictive approach to its logical extreme: conduct Bayesian inference entirely through predictive distributions, with no explicit prior at all. Edwin Fong’s D.Phil. thesis () and subsequent papers introduced the Martingale Posterior framework. The core idea is to view the “parameter” as the infinite sequence of future (or missing) observations. If we had the entire population or the entire infinite sequence Y_n+1:, any parameter of interest (like the true mean, or the underlying distribution F) would be known exactly. Thus uncertainty about θ is really uncertainty about the as-yet-unseen data. Fong et al. formalize this by directly assigning a joint predictive distribution for all future observations given the observed Y_1n. In notation, instead of a posterior p(θY_1n), they consider p(Y_n+1:Y_1n). This is a huge distribution (over an infinite sequence), but under exchangeability it encodes the same information as a posterior on θ. In fact, there is a one-to-one correspondence: if we choose the predictive distribution in the standard Bayesian way (by integrating the likelihood against a prior), then Doob’s martingale theorem implies the induced distribution on θ is exactly the usual posterior. Fong and colleagues instead relax this: they allow the user to specify any predictive mechanism (any sequence of one-step-ahead predictive densities) that seems reasonable for the problem, not necessarily derived from a likelihood-prior pair. As long as these predictive densities are coherent (a martingale in the sense of not contradicting themselves over time), we can define an implicit “posterior” for θ or for any function of the unseen data. They dub this the martingale posterior distribution, which “returns Bayesian uncertainty directly on any statistic of interest without the need for the likelihood and prior”. In practice, they introduce an algorithm called predictive resampling to draw samples from the martingale posterior. Essentially, we iteratively sample pseudo-future observations from the chosen predictive rule to impute an entire fake “completion” of the data, uses that to compute the statistic of interest, and repeats – thereby approximating the distribution of that statistic under the assumed predictive model. Martingale posteriors generalize Bayesian inference, subsuming standard posteriors when the predictive comes from a usual model, but also allowing robust or model-misspecified settings to be handled by choosing an alternative predictive (e.g. we might choose a heavy-tailed predictive density to guard against outliers, implicitly yielding a different “posterior”).

    In parallel, Berti et al. () developed a similar idea of Bayesian predictive inference without a prior. They work axiomatically with a user-specified sequence of predictives σ_n(x_1n)_n0 and establish general results for consistency and asymptotics of the resulting inference. One main advantage, as they note, is “no prior probability has to be selected” – the only inputs are the data and the predictive rule. These cutting-edge developments show how de Finetti’s viewpoint – once considered philosophically radical – is now driving methodological innovation for large-scale and robust Bayesian analysis. Today, the predictive approach is not only a cornerstone of Bayesian foundations but also an active area of research in its own right, influencing topics from machine learning (e.g. sequence modelling and meta-learning) to the theory of Bayesian asymptotics.

10 Incoming

11 References

Berti, Dreassi, Leisen, et al. 2023. Bayesian Predictive Inference Without a Prior.” Statistica Sinica.
Berti, Pratelli, and Rigo. 2004. Limit Theorems for a Class of Identically Distributed Random Variables.” The Annals of Probability.
———. 2012. Limit Theorems for Empirical Processes Based on Dependent Data.” Electronic Journal of Probability.
———. 2021. A Central Limit Theorem for Predictive Distributions.” Mathematics.
Blackwell, and MacQueen. 1973. Ferguson Distributions Via Polya Urn Schemes.” The Annals of Statistics.
Cifarelli, and Regazzini. 1978. “Nonparametric Statistical Problems Under Partial Exchangeability: The Use of Associative Means.” Annali Del’Instituto Di Matematica Finianziara Dell’Universita Di Torino, Serie.
De Finetti. 1937. La Prévision: Ses Lois Logiques, Ses Sources Subjectives.” In Annales de l’institut Henri Poincaré.
Ewens. 1972. The Sampling Theory of Selectively Neutral Alleles.” Theoretical Population Biology.
Flores. 2025. Predictive Coresets.”
Fong, C. H. E. 2021. The predictive view of Bayesian inference.”
Fong, Edwin, Holmes, and Walker. 2021. Martingale Posterior Distributions.”
Fong, Edwin, and Yiu. 2024. Asymptotics for Parametric Martingale Posteriors.”
Fortini, Ladelli, and Regazzini. 2000. Exchangeability, Predictive Distributions and Parametric Models.” Sankhy?: The Indian Journal of Statistics, Series A (1961-2002).
Fortini, and Petrone. 2012. Predictive Construction of Priors in Bayesian Nonparametrics.” Brazilian Journal of Probability and Statistics.
———. 2016. Predictive distribution (de Finetti’s view).” In Wiley StatsRef: Statistics Reference Online.
———. 2025. Exchangeability, Prediction and Predictive Modeling in Bayesian Statistics.” Statistical Science.
Ghosh. 2021. Bayesian Methods for Finite Population Sampling.
Ghosh, and Meeden. 1986. Empirical Bayes Estimation in Finite Population Sampling.” Journal of the American Statistical Association.
Hewitt, and Savage. 1955. Symmetric Measures on Cartesian Products.” Transactions of the American Mathematical Society.
Hollmann, Müller, Eggensperger, et al. 2023. TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second.”
Lee, Yun, Nam, et al. 2023. Martingale Posterior Neural Processes.”
Meeden, and Ghosh. 1983. Choosing Between Experiments: Applications to Finite Population Sampling.” The Annals of Statistics.
Muliere, Pietro, and Petrone. 1993. A Bayesian Predictive Approach to Sequential Search for an Optimal Dose: Parametric and Nonparametric Models.” Journal of the Italian Statistical Society.
Muliere, P., Secchi, and Walker. 2000. Urn Schemes and Reinforced Random Walks.” Stochastic Processes and Their Applications.
Pitman. 1993. Dependence.” In Probability.
———. 1995. Exchangeable and Partially Exchangeable Random Partitions.” Probability Theory and Related Fields.
Rubin. 1981. The Bayesian Bootstrap.” Annals of Statistics.
Ye, and Namkoong. 2024. Exchangeable Sequence Models Quantify Uncertainty Over Latent Concepts.”