The likelihood principle
2016-05-30 — 2023-07-23
Placeholder.
Yuling Yao, The likelihood principle in model check and model evaluation:
We are (only) interested in estimating an unknown parameter \(\theta\), and there are two data-generating experiments both involving \(\theta\) with observable outcomes \(y_1\) and \(y_2\) and likelihoods \(p_1\left(y_1 \mid \theta\right)\) and \(p_2\left(y_2 \mid \theta\right)\). If the outcome-experiment pair satisfies \(p_1\left(y_1 \mid \theta\right) \propto p_2\left(y_2 \mid \theta\right)\), (viewed as a function of \(\theta\)) then these two experiments and two observations will provide the same amount of information about \(\theta\).”
This idea seems to be useful in thinking about M-open, M-complete, M-closed problems.
1 An Introduction to the Likelihood Principle
In statistical inference, we are often faced with the same fundamental problem: we want to learn about some unknown aspect of a process — often represented by a parameter \(\theta\) — from data that the process generates. The likelihood principle offers a radical but powerful guideline for how those inferences should be made.
Suppose there are two different experiments, each designed to learn about the same parameter \(\theta\). The first experiment yields outcome \(y_1\), with likelihood function
\[ p_1(y_1 \mid \theta), \]
and the second yields outcome \(y_2\), with likelihood function
\[ p_2(y_2 \mid \theta). \]
The Likelihood Principle states:
If \(p_1(y_1 \mid \theta)\) and \(p_2(y_2 \mid \theta)\) are proportional as functions of \(\theta\), then the two data–experiment pairs provide exactly the same information about \(\theta\).
In other words, once we’ve observed data and written down the likelihood function, that function contains all the information about the parameter. How the data could have come out differently — the “stopping rule,” unused data, or other aspects of the design — does not affect our inference.
Traditionally, the likelihood principle is expressed in terms of parameters. But many modern statistical tasks are not about estimating a fixed but unknown \(\theta\); they are about prediction. In predictive inference, the focus is on the posterior predictive distribution:
\[ p(y_{\text{new}} \mid y) = \int p(y_{\text{new}} \mid \theta)\, p(\theta \mid y)\, d\theta. \]
Here, the principle carries over naturally: once we know the predictive distribution \(p(y_{\text{new}} \mid y)\), nothing else about the data collection process matters for making predictions about future outcomes. Two experiments that yield the same predictive distribution should lead to the same predictive conclusions, regardless of how the data were generated.
This principle can feel counterintuitive, and has been controversial. Classical frequentist methods often condition on the experimental design, even considering hypothetical outcomes that did not occur (e.g. in computing p-values). The likelihood principle, in contrast, insists that only the realized data matter. This difference explains many of the philosophical and practical debates between Bayesian and frequentist approaches.
1.1 Simple Examples
Coin flips with different stopping rules
- Experiment A: Flip a coin 12 times; we observe 9 heads and 3 tails.
- Experiment B: Flip a coin until we see 9 heads; it takes 12 flips.
- The data look different: one is “9 heads out of 12,” the other is “took 12 flips to reach 9 heads.”
- But the resulting likelihood functions for \(\theta\), the probability of heads, are proportional: both are proportional to \(\theta^9 (1-\theta)^3\).
- By the likelihood principle, the evidence about \(\theta\) is the same in both cases, even though a frequentist test might give different \(p\)-values depending on the stopping rule.
Predicting the next trial
Imagine we’re modeling the probability that the next customer clicks an ad. In one dataset we ran the ad 20 times and saw 5 clicks; in another, we ran it until we observed 5 clicks and it happened after 20 views. The two situations yield proportional likelihood functions for the click probability \(\theta\). Therefore, the posterior predictive distribution for the next click — essentially, our forecast for the next trial — is identical under either data-generating scheme. According to the likelihood principle, our predictive inferences should not depend on whether we stopped after 20 trials or after 5 successes.
1.2 Under mis-specification
The elegance of the likelihood principle rests on the assumption that the model family we are using actually contains the “true” data-generating process (the _M-closed world). In practice, we are usually in the M-open setting: the true mechanism is unknown and almost certainly not exactly captured by our chosen model.
In this misspecified world, proportional likelihoods no longer guarantee equivalent information. Two different experiments can yield proportional likelihood functions for \(\theta\), but if the model is wrong, those proportional likelihoods may lead to different predictive adequacy or different degrees of model misspecification. For example:
1.2.1 Well-specified case
Suppose we want to learn about the probability of heads \(\theta\) for a coin.
Experiment A (fixed \(n\)): Flip the coin \(n=12\) times. We observe \(y_1 = 9\) heads. The likelihood is
\[ p_1(y_1 \mid \theta) \propto \theta^9 (1-\theta)^3. \]
Experiment B (fixed number of successes): Flip the coin until we see 9 heads. It takes \(n=12\) flips. The likelihood is
\[ p_2(y_2 \mid \theta) \propto \theta^9 (1-\theta)^3. \]
Even though the experiments are designed differently, the likelihood functions are proportional. According to the likelihood principle, both experiments provide the same information about \(\theta\). Any inferential procedure that treats them differently (for instance, giving different \(p\)-values) is, from this perspective, ignoring relevant data and considering irrelevant counterfactuals.
1.2.2 Mis-specified case
Now suppose the coin isn’t really i.i.d. with a fixed \(\theta\), so now we are M-open. Maybe the coin is bent slightly more after each flip, so the chance of heads increases over time. Or maybe our assistant is unconsciously flipping harder when they get impatient. In other words, the true data-generating process isn’t actually binomial or negative binomial.
Now what happens?
- Under Experiment A (12 flips), our data are “9 out of 12, order ignored.” That discards the information about the order of outcomes — for instance, maybe we saw 8 tails followed by 9 heads.
- Under Experiment B (stop at 9 heads), the order and stopping rule are part of the data: it took exactly 12 flips to reach 9 heads, which tells us that 3 of the first 11 flips were tails. That detail is crucial if the coin’s bias was drifting.
In this M-open setting, the two datasets no longer produce proportional likelihoods relative to the true (but unknown) process, because the model we assumed (fixed-\(\theta\) Bernoulli) is wrong. The stopping rule now changes what information we actually have about how the system behaves, especially for prediction. If the probability of heads is increasing over time, then the fact that tails appeared early in the sequence under Experiment B is informative about the current (larger) chance of heads, while Experiment A would throw that order information away.
- In the M-closed idealization (where one of our models is literally true), the likelihood principle is compelling: proportional likelihoods mean equivalent evidence.
- In the M-open world (real life), proportionality relative to a misspecified model can hide differences. The data-generation mechanism matters because it affects how well the model aligns with reality.