The Predictive Approach to Bayesian Inference
Purely-predictive models, “The Italian school”, martingale posteriors, …
2025-07-10 — 2025-08-15
I don’t know much about this variant of Bayes, but the central idea is that we consider Bayes updating as a coherent betting rule and back everything else out from that. This gets us something like classic Bayes but with an even more austere approach to what probability is.
I am interested in this because, following an insight of Susan Wei’s, I note that it might be an interesting way of understanding when foundation models do optimal inference, since most neural networks are best understood as purely predictive models anyway.
Bayesian inference is traditionally introduced via unknown parameters: we place a prior on a parameter
We might care about this for philosophical or methodological reasons, or we might care about it because the great success of the age in machine learning (e.g. in neural nets) has been predicting observables rather than trying to tie these predictions to “true parameters”. So maybe we need to think about Bayes in that context too?
1 Questions
- How do we extend this to causal inference, especially causal abstraction?
2 Background and Notation
Throughout we need to distinguish distributions (uppercase letters) from their densities (lowercase letters) where both exist. We work on an infinite sequence of observations
taking values in some space
2.1 Classic (Parameter-Based) Bayesian Inference
Parameter_:
.Prior distribution on
: , with density .Likelihood of data
under parameter :Posterior distribution of
:Posterior predictive for a new observation
:or in density form
3 Predictive (de Finetti–Style) Inference
Rather than introduce an unobserved
where:
is the prior predictive (our belief about before seeing data). is the predictive rule after seeing .
If these
In effect, the
3.1 Key notation for predictives
Symbol | Meaning |
---|---|
Predictive distribution for |
|
Density of |
|
Implicit random probability measure (“parameter” in de Finetti’s sense). | |
Mixing measure on |
|
Hyperparameters in Dirichlet-process examples: |
- Parameter view is familiar to more people. we write down
and , update to , then integrate out to predict. - Predictive view flips that: we directly specify how we would predict the next datum at each step. If our predictive rule is coherent, there automatically exists some Bayes model behind it.
For many modern nonparametric and robust methods, we could imagine creating the
4 Practical implementation
You will notice that most of the theory below is elaborate and quite painful proof of symmetry properties for exponential family models under interesting but not exactly state-of-the-art world models. You might ask: can we actually compute predict Bayes in practice? Would this be helpful for building intuition?
Yes, it turns out there are a couple of useful models. In particular, Bayesian In-Context-Learning via transformers etc seems to be a pretty good model for a pure bayes predictive. Hollmann et al. (2023), a tabular data inference method and Lee et al. (2023), aNeural process regressor both make this connection explciit and are graet to play with. I find this is massively helpful in building intuitions, as opposed to, say, starting by proving theorems about factorisations of measures under exchangeability.
5 How it works
5.1 Exchangeability and de Finetti’s Theorem
Exchangeability is that assumption that our probabilistic beliefs about a sequence of random variables do not depend on the order in which data are observed. Formally,
The joint distribution factors into a product of predictive terms. As noted earlier,
. By Fubini’s theorem, we can swap integration order to see this as an iterative conditioning: first draw , then ; next draw , etc. The result is exactly . In fact, the one-step predictive distribution emerges naturally: where is the posterior distribution of given the observed data. This formula formalizes a simple idea: the predictive distribution for a new observation is the posterior mean of (the “urn” distribution). For example, in a coin-flip scenario, would be the true (but unknown) distribution of Heads/Tails; given some data, the predictive probability of Heads on the next flip is the posterior expected value of . If we had a prior on the coin’s bias , this predictive is – the well-known Beta-Binomial rule. Taking (a uniform prior), this gives after heads in tosses, which is Laplace’s “rule of succession.” Notice the predictive viewpoint reproduces such formulas in this case: it is literally the posterior expectation of the unknown frequency.Learning is reflected in the convergence of predictive distributions. Under exchangeability, the Strong Law of Large Numbers implies the empirical distribution of observations converges to the underlying
almost surely. In predictive terms, this means our predictive distribution eventually concentrates. A precise statement is: with probability 1, converges (as ) to a degenerate distribution that puts all its mass on the limiting empirical frequency . In other words, as we see more data, the sequence of predictive measures forms a martingale that converges to the true distribution almost surely. This is a powerful (main?) consistency property: no matter what the true is, an exchangeable Bayesian will (almost surely) eventually have predictive probabilities that match on every set . In classical terms, the posterior on converges to a point mass at the truth (if the model is well-specified). This was first shown by Joseph Doob in 1949 using martingale theory. It provides a frequentist validation of Bayesian learning in the exchangeable case. From the predictive perspective, it says if our predictive rule is coherent and eventually we see enough data, we will effectively discover the data-generating distribution – our forecasts become indistinguishable from frequencies in the long run.
As a time series guy one thing I found weird about this literature, which I will flag for you, dear reader, is that essentially everything here looks like it should extend to predicting arbitrary phenomena, but different schools introduce different assumptions upon the generating process, and it is not easy to back out which, at least not for me. Sometimes we are handling complicated dependencies, other times not. AFAICT ‘Predictive Bayes’ as such doesn’t impose much. But de Finetti’s actual proofs about exchangeability is essentially imposing i.i.d. structure upon the observations.
5.2 Predictive Distributions and Coherence Conditions
A predictive rule or predictive distribution sequence is a collection
i.e. the kernel
Symmetry: If we condition on any past data
, the predictive distribution must be a symmetric function of . That is, it depends on the past observations only through their multiset or empirical frequency. This is intuitively clear – if the order of past data doesn’t matter for joint probabilities, it shouldn’t matter for predicting the next observation either. So, for example, in an exchangeable coin toss model, can only depend on the number of heads in tosses (and ), not on the exact sequence.Associativity / Consistency: This one is more technical. Roughly it means the predictive rule must be internally consistent when extended to two steps. Formally, for any
, for any events in the state space, we requireand this should hold for all past
. Although this expression looks complicated, it is basically saying: if we consider two future time points and , the joint predictive for should be symmetric in those two (because the entire sequence is exchangeable). It ensures that our one-step predictive extends consistently to a two-step (and hence multi-step) predictive. In intuitive terms, condition (ii) is related to Kolmogorov consistency for the projective family of predictive distributions and the requirement of exchangeability on future samples. If symmetry (i) holds and this consistency (ii) holds, then there exists an exchangeable joint law producing those predictives. These are the predictive analog of de Finetti’s theorem: they characterize when a given predictive specification is “valid” (arises from some mixture model).
The takeaway is that to design a Bayesian model in predictive form, we must propose a rule
Note, we get that for free by just doing classic-falvoured parametric Bayes and assuming that our likelihood is correct. The extra work here is to make that more general
The predictive approach often yields an implicit description of the prior on
5.3 Parametric Models and Sufficient Statistics.
The predictive approach is fully compatible with parametric Bayesian models as well. If we have a parametric family
5.4 Martingales and the Martingale Posterior
One of the modern breakthroughs in predictive inference is recognizing the role of martingales in Bayesian updating. We saw earlier that the sequence of predictive distributions
Fong et al. note connections between martingale posteriors and the Bayesian bootstrap and other resampling schemes. The term “martingale” highlights that the sequence of these posteriors (as
6 Summary of Theoretical Insights
The predictive framework rests on a few key insights proven by the above results:
Observables suffice: If we can specify our beliefs about observable sequences (one-step-ahead at a time) in a consistent way, we have done the essential job of modelling. Theorems like de Finetti’s and its predictive extensions guarantee that a parameter-based description exists if we want it, but it’s optional. “Bayes à la de Finetti” means one-step probabilities are the building blocks.
Exchangeability is a powerful symmetry: It grants a form of “sufficientness” to empirical frequencies. For instance, in an exchangeable model, the predictive distribution for tomorrow given all past data depends only on the distribution of past data, not on their order. This leads to natural Bayesian consistency (learning from frequencies) and justifies why we often reduce data to summary statistics.
Predictive characterization of models: Many complex models (like BNP priors or hierarchical mixture models) can be characterized by their predictive rules. In some cases this yields simpler derivations. For example, it’s easier to verify an urn scheme than to prove a Chinese restaurant process formula from scratch. Predictive characterizations also allow extending models by modifying predictive rules (e.g. creating new priors via new urn schemes, such as adding reinforcement or memory).
Philosophical clarity: The predictive view clarifies what the “parameter” really is – usually some aspect of an imaginary infinite population. As Fong et al. put it, “the parameter of interest [is] known precisely given the entire population”. This demystifies
: it’s not a mystical quantity but simply a function of all unseen data. Thus, debating whether “exists” is moot – what exists are data (observed or not yet observed). This philosophy can be very practical: it encourages us to check our models by simulating future data (since the model is the prediction rule), and to judge success by calibration of predictions.Flexibility and robustness: By freeing ourselves from always specifying a likelihood, we can create “posterior-like” updates that may be more robust. For instance, we could specify heavier-tailed predictive densities than a Gaussian model would give, to reduce sensitivity to outliers, and still have a coherent updating scheme that quantifies uncertainty. This is one motivation behind general Bayesian approaches and martingale posteriors.
7 Practical Methods and Examples
Let’s walk through a few concrete examples and methods where the predictive approach is applied. We’ve already discussed the Pólya urn (Dirichlet process) and a simple Beta-Bernoulli model. Here we highlight additional applied scenarios:
In-context-learning: as we presaged above there are neural networks that give up on the proofs and just compute intersting Bayes predictive updates. See (Hollmann et al. 2023; Lee et al. 2023)
Bayesian Bootstrap (Rubin 1981): Suppose we have observed data
which we treat as a sample from some population distribution . The Bayesian bootstrap avoids choosing a parametric likelihood for and instead puts a uniform prior on the space of all discrete distributions supported on . In effect, we assume exchangeability and that the true puts all its mass on the observed points (as would be the case if these points were literally the entire population values, but we just don’t know their weights). The posterior for given the data then turns out to be a Dirichlet( ) on the point masses at . Consequently, the posterior predictive for a new observation is for each . In other words, the next observation is equally likely to be any of the observed values. This is precisely the Bayesian bootstrap’s predictive distribution, which is just the empirical distribution of the sample (sometimes with one extra point allowed to be new, but with flat prior that new point gets zero mass posterior). The Bayesian bootstrap can be viewed as a limiting case of the Dirichlet process predictive rule when (or, dually, when I say “I believe all probability mass is already in the observed points”). It’s a prime example of how a predictive assumption (the next point is equally likely to be any observed point) leads to an implicit prior (Dirichlet(1,…,1) on weights) and a posterior (the random weights after a Dirichlet draw). The Bayesian bootstrap is often used to generate approximate posterior samples for parameters like the mean or quantiles without having to assume any specific data distribution. This method has gained popularity in Bayesian data analysis, especially in cases where a parametric model is hard to justify. It is an embodiment of de Finetti’s idea: we directly express belief about future draws (they should resemble past draws in distribution) and that is our model.Predictive Model Checking and Cross-Validation: In Bayesian model evaluation, a common practice is to use the posterior predictive distribution to check goodness-of-fit: simulate replicated data
from and compare to observed . Any systematic difference may indicate model misfit. This is fundamentally a predictive approach: rather than testing hypotheses about parameters, we ask “does the model predict new data that look like the data we have?”. It aligns perfectly with the predictive view that the ultimate goal of inference is accurate prediction. In fact, modern Bayesian workflow encourages predictive checks at every step. Additionally, methods like leave-one-out cross-validation (LOO-CV) can be given a Bayesian justification via the predictive approach. The LOO-CV score is essentially the product of over all (the probability of each left-out point under the predictive based on the rest). Selecting models by maximizing this score (or its logarithm) is equivalent to maximizing predictive fit. Some recent research (including by Fong and others) formally connects cross-validation to Bayesian marginal likelihood and even proposes cumulative cross-validation as a way to score models coherently. The philosophy is: a model is good if it predicts well, not just if it has high posterior probability a priori. By building model assessment on predictive distributions, we ensure the evaluation criteria align with the end-use of the model (prediction or forecasting).Coresets and Large-Scale Bayesian Summarization: (Flores 2025) A very recent application of predictive thinking is in creating Bayesian coreset algorithms – these aim to compress large datasets into small weighted subsets that yield almost the same posterior inference. Traditionally, coreset construction tries to approximate the log-likelihood of the full data by a weighted log-likelihood of a subset (minimizing a KL divergence). However, this fails for complex non-iid models. Flores (2025) proposed to use a predictive coreset: choose a subset of points such that the posterior predictive distribution of the subset is close to that of the full data. In other words, rather than matching likelihoods, match how well the subset can predict new data like the full set would. This approach explicitly cites the predictive view of inference (E. Fong and Yiu 2024; Fortini and Petrone 2012) as inspiration. The result is an algorithm that works even for models where likelihoods are intractable (because it can operate on predictive draws). This is a cutting-edge example of methodological innovation driven by predictive thinking.
Machine Learning and Sequence Modelling: It’s worth noting that in machine learning, modern large models (like transformers) are often trained to do next-token prediction on sequences. In some recent conceptual work, researchers have drawn a connection between such pre-trained sequence models and de Finetti’s theory. Essentially, a large language model that’s been trained on tons of text is implicitly representing a predictive distribution for words given preceding words. If the data (text) were regarded as exchangeable in some blocks, the model is doing a kind of empirical Bayes (using the training corpus as prior experience) to predict new text. Some authors (Ye and Namkoong 2024) have even argued that in-context learning by these models is equivalent to Bayesian updating on latent features, “Bayesian inference à la de Finetti”. While these ideas are still speculative, they illustrate how the predictive perspective resonates in ML: the focus is entirely on
. If we were to build an AI that learns like a Bayesian, it might well do so by honing its predictive distribution through experience, rather than by explicitly maintaining a distribution on parameters. This is essentially what these sequence models do, albeit not in a fully coherent probabilistic way. Application in (Hollmann et al. 2023; Lee et al. 2023)
8 History
I got an LLM to summarize the history for me:
9 Historical Timeline of the Predictive Framework
1930s – de Finetti’s Foundation: In 1937, Bruno de Finetti published his famous representation theorem for exchangeable sequences, laying the cornerstone of predictive Bayesian inference. An infinite sequence of observations
is exchangeable if its joint probability is invariant under permutation of indices. De Finetti’s theorem states that any infinite exchangeable sequence is equivalent to iid sampling from some latent random probability distribution ; for suitable observables (say ):where
is a “mixing” measure on distribution functions (this serves as a prior over in Bayesian terms). Intuitively, if we believe the are exchangeable, we act as if there is some unknown true distribution governing them; given , the data are iid. De Finetti emphasized that itself is an unobservable construct – what matters are the predictive probabilities for future observations. His philosophical stance was that probability is about our belief in future observable events, not about abstract parameters. He often illustrated this with betting and forecasting interpretations, effectively treating inference as an updating of predictive “previsions” (expectations of future quantities). De Finetti’s ideas formed the philosophical bedrock of the Italian school of subjective Bayesianism, shifting focus toward prediction.1950s – Formalization of Exchangeability: Following de Finetti, mathematical statisticians solidified the theoretical underpinnings. Hewitt and Savage (1955) provided a rigorous existence proof for de Finetti’s representation via measure-theoretic extension theorems (ensuring a mixing measure
exists for any exchangeable law). This period established exchangeability as a fundamental concept in Bayesian theory. Simply put, exchangeability = “iid given some ”. This result, sometimes called de Finetti’s theorem, became a “cornerstone of modern Bayesian theory”. It means that specifying a prior on the parameter (or on ) is mathematically equivalent to specifying a predictive rule for the sequence. In fact, we can recover de Finetti’s mixture form by multiplying one-step-ahead predictive probabilities:and for an exchangeable model this product must equal the integral above. This insight – that a joint distribution can be factorized into sequential predictive distributions – is central to the predictive approach.
1970s – Bayesian Nonparametrics and Urn Schemes: Decades later, de Finetti’s predictive philosophy found new life in Bayesian nonparametric (BNP) methods. In 1973, Blackwell and MacQueen (1973) introduced the Pólya’s urn scheme as a constructive predictive rule for the Dirichlet Process (DP) prior, which Ferguson had proposed that same year as a nonparametric prior on distributions. Blackwell and MacQueen showed that if
(a base distribution) and for each , then the sequence is exchangeable and in fact mixed with the urn scheme defines a Dirichlet process law. In this predictive rule, with probability the th draw is a new value sampled from , and with probability it repeats one of the previously seen values (specifically, it equals the j-th distinct value seen so far, which occurred times). This elegant scheme generates clusters of identical values and is the basis of the Chinese restaurant process in machine learning. Importantly, it required no explicit mention of a parameter – the predictive probabilities themselves defined the model. The Dirichlet process became the canonical example of a prior that is constructed via predictive distributions. Around the same time, Cifarelli and Regazzini (1978) in Italy discussed Bayesian nonparametric problems under exchangeability, and Ewens’s sampling formula (Ewens 1972) in population genetics provided another famous predictive rule for random partitions of species. These developments showed the power of de Finetti’s idea: we can build rich new models by directly formulating how observations predict new ones.1980s – Predictive Inference and Model Assessment: By the 1980s, the predictive viewpoint began influencing statistical practice and philosophy outside of nonparametrics. Seymour Geisser advanced the idea that predictive ability is the ultimate test of a model – he promoted predictive model checking and advocated using the posterior predictive distribution for model assessment and selection (foundational to modern cross-validation approaches). In 1981, Rubin introduced the Bayesian bootstrap, an alternative to the classical bootstrap, which can be seen as a predictive inferential method: it effectively assumes an exchangeable model where the “prior” is that the
observed data points are a finite population from which future samples are drawn uniformly at random. The Bayesian bootstrap’s posterior predictive for a new observation is simply the empirical distribution of the observed sample (with random weights), which aligns with de Finetti’s view of directly assigning probabilities to future data without a parametric likelihood. Ghosh and Meeden (Ghosh and Meeden 1986; Ghosh 2021) further developed Bayesian predictive methods for finite population sampling, treating the unknown finite population values as exchangeable and focusing on predicting the unseen units – again, no explicit parametric likelihood was needed. These works kept alive the notion that Bayesian inference “a la de Finetti” – with predictions first – could be practically useful. However, at the time, mainstream Bayesian statistics still largely centred on parametric models and priors, so the predictive approach was a somewhat heterodox perspective, championed by a subset of Bayesian thinkers.1990s – The Italian School and Generalized Exchangeability: The 1990s saw renewed theoretical interest in characterizing exchangeable structures via predictions. Partial exchangeability (where data have subgroup invariances, like Markov exchangeability or other structured dependence) became a focus. In 1995, Jim Pitman generalized the Pólya urn to a two-parameter family (the Pitman–Yor process), broadening the class of predictive rules to capture power-law behavior in frequencies (Pitman 1995). In Italy, scholars like Eugenio Regazzini, Pietro Muliere, and their collaborators began exploring reinforced urn processes and other predictive constructions for more complex sequences. For example, Pietro Muliere and Petrone (1993) applied predictive mixtures of Dirichlet processes in regression problems, and P. Muliere, Secchi, and Walker (2000) introduced reinforced urn models for survival data. These models were essentially Markov chains whose transition probabilities update with reinforcement (i.e. past observations feed back into future transition probabilities), and they showed such sequences are mixtures of Markov chains – a type of partially exchangeable structure. Throughout, the strategy was to start by positing a plausible form for the one-step predictive distribution and then deduce the existence and form of the underlying probability law or “prior.” This reversed the conventional approach: instead of specifying a prior then deriving predictions, we specify predictions and thereby defined an implicit prior. By the end of the 90s, the groundwork was laid for a systematic predictive construction of Bayesian models.
2000s – Predictive Characterizations and New Priors: In 2000, a landmark paper (Fortini, Ladelli, and Regazzini 2000) formalized the conditions for a predictive rule to yield exchangeability. They gave precise necessary and sufficient conditions on a sequence of conditional distributions
such that there exists some exchangeable joint law producing them. In essence, they proved that symmetry (the predictive probabilities depend on data only through symmetric functions like counts) and a certain consistency (related to associative conditioning of future predictions) characterize exchangeability. This result (along with earlier work by Diaconis and Freedman on sufficiency) provided a rigorous predictive criterion: we can validate if a proposed prediction rule is coherent (comes from some exchangeable model) without explicitly constructing the latent parameter. Around the same time, new priors in BNP were being defined via predictive structures. For instance, the species sampling models (Boothby, Pitman, etc.) were recognized as those exchangeable sequences with a general predictive form (for some constants and distinct values so far), which yields various generalizations of the Dirichlet process. The Italian school played a leading role: they worked out how popular nonparametric priors like Dirichlet processes, Pitman–Yor processes, and others can be derived from a sequence of predictive probabilities. Priors by prediction became a theme. Fortini and Petrone (2012) wrote a comprehensive review on predictive construction of priors for both exchangeable and partially exchangeable scenarios. They highlighted theoretical connections and revisited classical results “to shed light on theoretical connections” among predictive constructions. By the end of the 2000s, it was clear that we could either start with a prior or directly with a predictive mechanism – the two routes were provably equivalent if done consistently, but the predictive route often yielded new insights.2010s – Consolidation and Wider Adoption: In the 2010s, the predictive approach gained broader recognition and was increasingly connected to modern statistical learning. Fortini and Petrone continued to publish a series of works extending the theory: they explored predictive sufficiency (identifying what summary of data preserves all information for predicting new data), and they characterized a range of complex priors via predictive rules (from hierarchical priors to hidden Markov models built on predictive constructions). For example, they showed how an infinite Hidden Markov Model (used in machine learning for clustering time series) can be seen as a mixture of Markov chains, constructed by a sequence of predictive transition distributions. Meanwhile, machine learning researchers, notably in the topic modelling and Bayesian nonparametric clustering communities, adopted the language of exchangeable partitions (the Chinese restaurant process, Indian buffet process, etc., all essentially predictive rules). The review article Fortini and Petrone (2016) distilled the philosophy and noted how the predictive approach had become central both to Bayesian foundations and to practical modelling in nonparametrics and ML. Another development was the exploration of conditionally identically distributed (CID) sequences (weaker than full exchangeability) and other relaxations – these allow some trend or covariate effects while retaining a predictive structure. Researchers like Berti contributed here, defining models where only a subset of predictive probabilities are constrained by symmetry (Berti, Pratelli, and Rigo 2004, 2012). All these efforts reinforced that de Finetti’s perspective is not just philosophical musing – it leads to concrete new models and methods.
2020s – Martingale Posteriors and Prior-Free Bayesianism: Very recent years have witnessed a surge of interest in prior-free or prediction-driven Bayesian updating rules. Two parallel lines of work – one by Fong, Holmes, and Walker in the UK, and another by Berti, Rigo, and collaborators in Italy – have pushed the predictive approach to its logical extreme: conduct Bayesian inference entirely through predictive distributions, with no explicit prior at all. Edwin Fong’s D.Phil. thesis (C. H. E. Fong 2021) and subsequent papers introduced the Martingale Posterior framework. The core idea is to view the “parameter” as the infinite sequence of future (or missing) observations. If we had the entire population or the entire infinite sequence
, any parameter of interest (like the true mean, or the underlying distribution ) would be known exactly. Thus uncertainty about is really uncertainty about the as-yet-unseen data. Fong et al. formalize this by directly assigning a joint predictive distribution for all future observations given the observed . In notation, instead of a posterior , they consider . This is a huge distribution (over an infinite sequence), but under exchangeability it encodes the same information as a posterior on . In fact, there is a one-to-one correspondence: if we choose the predictive distribution in the standard Bayesian way (by integrating the likelihood against a prior), then Doob’s martingale theorem implies the induced distribution on is exactly the usual posterior. Fong and colleagues instead relax this: they allow the user to specify any predictive mechanism (any sequence of one-step-ahead predictive densities) that seems reasonable for the problem, not necessarily derived from a likelihood-prior pair. As long as these predictive densities are coherent (a martingale in the sense of not contradicting themselves over time), we can define an implicit “posterior” for or for any function of the unseen data. They dub this the martingale posterior distribution, which “returns Bayesian uncertainty directly on any statistic of interest without the need for the likelihood and prior”. In practice, they introduce an algorithm called predictive resampling to draw samples from the martingale posterior. Essentially, we iteratively sample pseudo-future observations from the chosen predictive rule to impute an entire fake “completion” of the data, uses that to compute the statistic of interest, and repeats – thereby approximating the distribution of that statistic under the assumed predictive model. Martingale posteriors generalize Bayesian inference, subsuming standard posteriors when the predictive comes from a usual model, but also allowing robust or model-misspecified settings to be handled by choosing an alternative predictive (e.g. we might choose a heavy-tailed predictive density to guard against outliers, implicitly yielding a different “posterior”).In parallel, Berti et al. (2023) developed a similar idea of Bayesian predictive inference without a prior. They work axiomatically with a user-specified sequence of predictives
and establish general results for consistency and asymptotics of the resulting inference. One main advantage, as they note, is “no prior probability has to be selected” – the only inputs are the data and the predictive rule. These cutting-edge developments show how de Finetti’s viewpoint – once considered philosophically radical – is now driving methodological innovation for large-scale and robust Bayesian analysis. Today, the predictive approach is not only a cornerstone of Bayesian foundations but also an active area of research in its own right, influencing topics from machine learning (e.g. sequence modelling and meta-learning) to the theory of Bayesian asymptotics.